0% found this document useful (0 votes)

9 views29 pages

Understanding Multi-armed Bandits Concepts

This is notes from a class

Uploaded by

natedaeila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views29 pages

Understanding Multi-armed Bandits Concepts

This is notes from a class

Uploaded by

natedaeila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Some Notes on Multi-armed Bandits

Kevin Jamieson, University of Washington

These notes were written for myself to refer to while lecturing. They are not a replacement for
the course textbooks [Lattimore and Szepesvári, 2020, Bubeck et al., 2012] and may contain
errors! I have posted them by request.

1 Introduction
Machine learning, and in particular, supervised learning, is the study of making statistical inferences from
previously collected data. Multi-armed bandits is more about an interaction between an agent (algorithm)
and an environment where one simultaneously collects data and makes inferences in a closed-loop.
You have n “arms” or actions, representing distributions. “Pulling” an arm represents requesting a
sample from that arm.
At each time t = 1, 2, 3, . . .
• Algorithm chooses an action It ∈ {1, . . . , n}
• Observes a reward XIt ,t ∼ PIt where P1 , . . . , Pn are unknown distributions
That is, playing arm i and time s results in a reward Xi,s from the ith distribution. In these lectures, all
distributions will be Gaussian (or sub-Gaussian) with variance 1 unless otherwise specified. Example of
sub-Gaussian distribution is bounded distributions on [−1, 1] or Gaussian N (0, 1). Formally, a distribution
of X is 1-sub-Gaussian if E[exp(λX)] ≤ exp(λ2 /2).
We will find that the means of the distribution are the most pertinent parameters of these distributions.
Let θi∗ = EX∼Pi [X] be the mean of the ith distribution. Define ∆i = maxj=1,...,n θj∗ − θi∗ . We measure
performance of an algorithm in two ways: 1) how much total reward is accumulated, and 2) how many total
pulls are required to identify the best mean.

1.1 Regret Minimization

After T time steps, define the regret as
" T
#
X
RT = max θj∗ T − E XIt ,t
j=1,...,n
t=1

The goal is to have R(T ) = o(T ) to achieve sub-linear regret.

If at time T the ith arm has been played Ti times, then
" T #
X
∗
RT = max θj T − E XIt ,t
j=1,...,n
t=1
T
" n #
X X
= max θj∗ T − E Xi,t 1{It = i}
j=1,...,n
t=1 i=1
n
" T #
X X
= max θj∗ T − θi∗ E 1{It = i}
j=1,...,n
i=1 t=1
Xn
= max θj∗ T − θi∗ E [Ti ]
j=1,...,n
i=1
n
X
= ∆i E [Ti ]
i=1

Thus, we want to minimize the number of times we play sub-optimal arms.

1
1.2 Best-arm identification
Given a δ ∈ (0, 1) identify the best arm with probability at least 1 − δ using as few total pulls as possible.
While related, these objectives are at odds with one another. Sometimes called the (, δ)-PAC setting,
but for simplicity we’ll take = 0.

1.3 Warm-up
Suppose n = 2. How long would it take to decide one arm was better than another using sub-gaussian
bounds? Consider the trivial algorithm:
Input: 2 arms, time τ ∈ N.
Pull each arm i ∈ {1, 2} exactly τ times and compute empirical mean θbi .
For all t > 2τ play arm arg maxi θbi
Proposition 1. Fix , δ. If Z1 , Z2 , . . . are independent mean-zero σ 2 -sub-Gaussian
Pτ random variables so that
E[exp(λZt )] ≤ exp(λ2 σ 2 /2), then for τ = d2σ 2 −2 log(1/δ)e we have P( τ1 t=1 Zt ≤ ) ≥ 1 − δ.
Proof.
τ τ
!
1X X
P( Zt > ) = P(exp λ Zt > exp(λτ ))
τ t=1 t=1
" τ
!#
X
≤ e−λτ E exp λ Zt
t=1
τ
Y
= e−λτ E [exp (λZt )]
t=1
= e−λτ exp λ2 σ 2 τ /2

= exp −λτ + λ2 σ 2 τ /2

≤ exp −τ 2 /2σ 2 ≤ δ

for the chosen τ .

Pτ
Set τ = d8∆−2 log(4/δ)e and let θbi = 1
τ s=1 Xi,s for i = 1, 2. Define the event
( r )
∗ 2 log(4/δ)
Ei := |θbi − θi | ≤ .
τ

Then P(E1c ∪ E2c ) ≤ P(E1c ) + P(E1c ) ≤ δ. Thus, if we pull each arm τ times then on E1 ∩ E2 we have
θb1 > θ1∗ − ∆/2
≥ θ2∗ + ∆/2
> θb2
so that we have determined the best-arm. And we can play it forever.
After any T total plays such that arm i has been played Ti times and T = T1 + T2 , the expected regret
is at most
" T #
X
∗
θ1 T − E XIs ,s = θ1∗ T − E [(T1 θ1∗ + T2 θ2∗ )]
s=1
= E [T2 ∆]
= E [T2 ∆1{E1 ∩ E2 } + T2 ∆1{E1c ∪ E2c }]
≤ E [τ ∆1{E1 ∩ E2 } + T ∆1{E1c ∪ E2c }]
≤ 8∆−1 log(4/δ) + ∆T P(E1c ∪ E2c )
≤ 8∆−1 log(4/δ) + ∆T δ.

2
If we take δ = 1/T then the expected regret is less than ∆ + 8∆−1 log(4T ). On the other hand, the regret
can’t possibly be greater than ∆T , thus the total regret is bounded by
" T #
X
∗
θ1 T − E XIs ,s = min{T ∆, ∆ + 8∆−1 log(4T )}
s=1
p
≤1+2 8T log(4T )
p
where the last step takes the worst case ∆ = 8 log(4T )/T .
Takeaway: For√ very small ∆ we lose almost nothing, for very large ∆ its easy to distinguish, its maximized
at around 1/ T . We’ll see this again.

1.4 Disclaimers and goals of these lectures

To goal of these lectures is to provide an overview of the kinds of problems multi-armed bandits can solve,
and basic algorithmic tropes. By the end of these lectures I hope you will be able to model a real-world
problem as an appropriate bandit problem and propose an algorithm that is a strong baseline.
As a consequence, I will not be stressing important practicalities of algorithms that muck up the analysis.
That is, I will pay for a slightly weaker performing algorithm in exchange for a dramatically simpler analysis.
This means that I will i) assume the horizon times T for regret analyses is known in advance, ii) I will take
liberties with constants and some log factors, making notes of the best known results when pertinent, iii)
will always assume there is a uniquely optimal arm for best-arm identification and aim to find it, not merely
an -good arm, iv) I will only study best-arm identification in the fixed confidence setting (exists a whole
literature on fixed budget setting), v) will assume sub-Gaussian distributions everywhere and nothing more.

2 Action Elimination Algorithm for Multi-armed Bandits

Input: n arms X = {1, . . . , n}, confidence level δ ∈ (0, 1).
Let Xb1 ← X , t ← 1
while |Xb` | > 1 do
t = 2−t
4`2 |X |
Pull each arm in Xb` exactly τ` = d2−2 ` log( δ )e times
Compute the empirical mean of these rewards θbi,` for all i ∈ Xb`

Xb`+1 ← Xb` \ i ∈ Xb` : maxj∈Xb` θbj,` − θbi,` > 2`
t←t+1
Output: Xbt+1

Lemma 1. Assume that maxi∈X ∆i ≤ 4. With probability at least 1−δ, we have 1 ∈ Xb` and maxi∈Xb` ∆i ≤ 8`
for all ` ∈ N.
Proof. For any ` ∈ N and i ∈ [n] define
n o
Ei,` = |θbi,` − θi∗ | ≤ `
T n T∞ q 2 /δ)
and E = i=1 `=1 Ei,` . Noting that ` = 2 log(4n`
τ` we have
∞ ∞
n [
! n X
[ X δ
P(E c ) = P c
Ei,` ≤ ≤ δ.
i=1 `=1 i=1 `=1
2n`2

In what follows assume E holds.

Fix any ` for which 1 ∈ Xb` (note 1 ∈ Xb1 ). Then for any j ∈ Xb` we have

θbj,` − θb1,` = (θbj,` − θj∗ ) − (θb1,` − θ1∗ ) − ∆`

E
≤ 2`

3
which implies 1 ∈ Xb`+1 . Thus, 1 ∈ Xb` for all `. On the other hand, any i for which ∆i = θ1∗ − θi∗ > 4` we
have

max θbj,` − θbi,` ≥ θb1,` − θbi,`

j∈X
b`

= (θbj,` − θj ) − (θbi,` − θi ) − ∆i
> 2` − 4` = 2`

which implies this maxj∈Xb`+1 θj∗ ≥ θ1∗ − 4` = θ1∗ − 8`+1 .

Theorem 1. Assume that maxi∈X ∆i ≤ 4. Then with probability at least 1 − δ, 1 is returned from the
algorithm at a time τ that satisfies
n
X
τ ≤c ∆−2 −2
i log(n log(∆i )/δ)
i=2

Proof. If ∆ = mini6=1 ∆i then Xb` = {1} for t ≥ dlog2 (8∆)e. Note that

dlog2 (8∆)e
X
Ti = τ` 1{i ∈ Xb` }
`=1
dlog2 (8∆)e
X
≤ τ` 1{∆i ≤ 8` }
`=1
dlog2 (8∆−1
i )e
X
= τ`
`=1
dlog2 (8∆−1
i )e
2
4` |X |
X
= d2−2
` log( δ )e
`=1
dlog2 (8∆−1
i )e
4 log22 (16∆−2
i )|X |
X
≤ d2 log( δ )e 4`
`=1
4 log22 (16∆−2
i )|X |
≤ c∆i−2 log( δ )

which implies
Thus, the total number of samples taken before Xb` = {1} is equal to
n n
4 log22 (16∆−2
i )|X |
X X
Ti ≤ c∆−2
i log( δ )
i=1 i=1
Pn
which implies that one can identify the best arm after no more than i=2 ∆−2 −2
i log(n log(∆i )/δ).

Theorem 2. Assume that maxi∈X ∆i ≤ 4. For any T ∈ N, with probability at least 1 − δ

n
−1
c(∆i ∨ ν)−1 log( log((∆i ∨ν) )|X |
X X
Ti ∆i ≤ inf νT + δ ).
ν≥0
i:∆i >0 i=1

Pn
∆−1
p
Moreover, if the algorithm is run with δ = 1/T then RT ≤ c i=2 i log(T ) and RT ≤ c nT log(T ).

4
Suppose you run for T timesteps. Then for any ν ≥ 0 the regret is bounded by:
n
X X
∆i Ti ≤ νT + 8` τ` |Xb` |
i=2 `:8` >ν
dlog2 (8(∆∨ν)−1 )e
X
≤ νT + 8` τ` |Xb` |
`=1
−1
n dlog2 (8(∆∨ν)
X X )e
≤ νT + 8` τ` 1{∆i ≤ 8` }
i=1 `=1
−1
n dlog2 (8(∆
X X i ∨ν) )e
= νT + 8` τ`
i=1 `=1
−1
n dlog2 (8(∆ i ∨ν) )e
2
4` |X |
X X
= νT + 8` d2−2
` log( δ )e
i=1 `=1
n dlog2 (8(∆i ∨ν)−1 )e
4 log22 (8(∆i ∨ν)−2 )|X |
X X
≤ νT + c log( δ ) 2`
i=1 `=1
n
−1
c(∆i ∨ ν)−1 log( log((∆i ∨ν) )|X |
X
≤ νT + δ )
i=1
Pn −1 −1
Setting ν = 0 yields a regret of i=2p ∆i log(n log(∆i )/δ). On the other hand, using ∆i ∨ ν ≥ ν and
minimizing over ν yields a regret of nT log(n log(T )/δ). The expected regret, of course, is then bounded
by
n
" n #
X X
∆i E[Ti ] = E ∆i Ti
i=2 i=2
n
X
≤ ∆−1 −1 c
i log(n log(∆i )/δ) + T P(E )
i=2
Pn
Setting δ = 1/T implies the regret is less than i=2 c∆−1
i log(T ).
Some remarks:
• This analysis doesn’t reuse samples from previous rounds, it is easy to make this change.
• Regret bound requires knowledge of T a priori.

3 Lower bounds for Multi-armed Bandits

Let us briefly pause to consider how far off from optimal we are, and then think about an algorithm that
could get us to optimality. How do we know we’re doing okay?

3.1 Mean of a Gaussian

Suppose
Pn I get n samples from a Gaussian
p distribution N (µ, 1). You compute the empirical mean µ b =
1
n i=1 X i . We know that |b
µ − µ| ≤ 2 log(2/δ)/n. How tight is this? If µ ∈ {0, ∆} then we just need
n = 8∆−2 log(2/δ)1
1 Using the SPRT, as δ → 0 one needs just an expected number of samples equal to 2∆−2 log(2/δ).

5
2 2
1 −(x−µ) /2σ
Let pµ (x) = 2π e be the Gaussian distribution with mean µ. Under H0 , Xi ∼ p0 and under
H1 , Xi ∼ p∆ . Let φ : Rn → {0, ∆}. Then the minimax probability of error is equal to
1
inf max{P0 (φ = 1), P1 (φ = 0)} ≥ inf (P0 (φ = 1), P1 (φ = 0))
φ φ 2
Z Z
1
= inf 1{φ(x) = 1}p0 (x)dx + 1{φ(x) = 0}p1 (x)dx
φ 2 x∈Rn x∈Rn
Z
1
= min{p0 (x), p1 (x)}dx
2 x∈Rn
Z 2
1 p
≥ p0 (x)p1 (x)dx (Cauchy-Schwartz)
4 x∈Rn
Z
1
≥ exp − log( pp10 (x) )p
(x) 1 (x)dx (Jensen’s)
4 x∈Rn

where
Z 2 Z 2
p p
p0 (x)p1 (x)dx = min{p0 (x), p1 (x)} max{p0 (x), p1 (x)}dx
x∈Rn n
Z x∈R Z
≤ min{p0 (x), p1 (x)}dx max{p0 (x), p1 (x)}dx (Cauchy-Schwartz)
x∈Rn x∈Rn
Z
≤2 min{p0 (x), p1 (x)}dx
x∈Rn

and (integrating only over support of pq)

Z 2 Z
p p
p0 (x)p1 (x)dx = exp 2 log( p0 (x) p1 (x)/p0 (x)dx)
x∈Rn x∈Rn
Z
p
≥ exp 2 p0 (x) log( p1 (x)/p0 (x))dx
n
Zx∈R
= exp − log( pp01 (x)
(x)
)p1 (x)dx
x∈Rn

Note that
Z n
! n
Y p1 (xi ) Y
KL(P1 |P0 ) = log p1 (xi )dx
x i=1
p0 (xi ) i=1
= nKL(p1 |p0 ) = n∆2 /2

and that KL(N (0, 1)|N (∆, 1)) = ∆2 /2.

We conclude that
1
exp −n∆2 /2

inf max{P0 (φ = 1), P1 (φ = 0)} ≥
φ 4
Thus, to determine whether or not n samples are from a Gaussian with mean 0 or ∆ with probability of
failure less than δ, one needs n ≥ 2∆−2 log(1/4δ).

3.2 Indentification
An algorithm for best-arm identification at time t is described by given a history (Is , Xs )s<t for each time t
is described by a
• selection rule It ∈ [n] is Ft−1 measurable where Ft = σ(I1 , X1 , I2 , X2 , . . . , It−1 , Xt−1 )
• stopping time τ is Ft measurable, and

6
• recommendation rule bi ∈ [n] invoked at time τ which is Fτ -measurable.
Definition 1. We say that an algorithm for best-arm identification is δ-PAC if for all θ∗ ∈ Rn we have
Pθ∗ (bi = arg maxi∈[n] θi∗ ) ≥ 1 − δ.
The following is due to [Kaufmann et al., 2016], a strengthening of the first time it appeared in [Mannor and Tsitsiklis, 2004
Theorem 3 (Best-arm identification lower bound). Any algorithm Pnthat is δ-PAC on {P : Pi = N (θi , 1), θ1 >
1
maxi6=1 θi , θ ∈ [0, 1]n } for δ < 0.15 satisfies Eθ∗ [τ ] ≥ 2 log( 2.4δ ) i=1 ∆−2
i .

Proof sketch: The original instance has Pi = N (θi∗ , 1). Pick some j ∈ [n] and define an alternative mean
(j) (j)
vector θ(j) ∈ [0, 1]n such that θi = θi∗ if i 6= j and θi = θ1 + for j = i for some arbitrarily small number
(j)
. Note that under θ , arm j is the best arm.
Because the algorithm claims to be δ-PAC, it has to output arm 1 under θ∗ and arm j under θ(j) . But
these two bandit games only differ on arm j so to tell the difference between them its only natural to sample
arm j until one can figure out which instance is being played (i.e., is its mean θj or θ1 + ?) The discussion
above suggests that to make this distinction with probability at least 1 − δ, it is necessary to sample arm j
at least 2(θ1 − θj + )−2 log(1/4δ) times. Taking to zero and noticing that j was arbitrary completes the
sketch.
This is not a proof, however, because the number of times the algorithm samples arm j is random whereas
in the above argument it was fixed. The proof of [Kaufmann et al., 2016] provides convenient tools to prove
general lower bounds for δ-PAC settings.

3.3 Regret, minimax

Theorem p4 (Minimax regret lower bound). For every T ≥ n there exists an instance P = N (θ∗ , I) such
that RT ≥ (n − 1)T /27.
Proof sketch: Let θ∗ = θ = (∆, 0, . . . , 0). For any algorithm, by the pigeon hole principle, there exists an
arm bi ∈ [n] such that E[Tbi ] ≤ T /n.
Define an alternative Gaussian instance with mean vector θ0 that is identical to θ other than θbi = 2∆.
p
If ∆ ≈ n/T then bi will not be given enough samples to distinguish between the two instances, which
means E[T1 ] will be about the same under both models. √
Under θ, if E[T1 ] ≤ T /2 then the regret incurred is at√least ∆T /2 ≈ nT . On the other hand, under θ0 ,
if E[T1 ] > T /2 then the regret again is at least ∆T /2 ≈ nT .
This is not a proof because again the number of times an arm is pulled is random, but as before, these
arguments can be made precise.

3.4 Gap-dependent regret

Lemma 2. Any strategy that satisfies E[Ti (t)] = o(ta ) for any arm i with ∆i > 0 and a ∈ (0, 1), we have
R̄T Pn 2
that limT →∞ inf log(T ) = i=2 ∆i .

Takeaway: This is what his field does: prove an initial upper, then lower, then chase it.

3.5 Revisiting MAB with Optimism

Why go beyond action elimination algorithms? Because they will never hit the asymptotic lower bound, for
one thing, since if we look at when theqsecond to last arm exits, the lowerbounds are the same.
α-UCB which is arg maxi θbi,T (t) + 2α log(t) as α → 1 achieves the lower bound.
i Ti (t)
b1 ≈ µ1 . Minimizing
Any sub-linear regret algorithm plays arm 1 an infinite number of times, so assume µ
the maximum upper bound. Thus, we expect the number of times the ith arm is pulled is 2∆−2 i log(T ),
which is optimal.
UCB1 in its most popular
√ form was developed by [Auer et al., 2002].
MOSS first achieved nT regret [Audibert and Bubeck, 2009].
KL-UCB is finite-time analysis with optimal constants for asymptotic regret [Cappé et al., 2013].

7
The recent work of [Lattimore, 2018] defined a UCB-based algorithm that achieves asymptotic optimal
√
constants, and finite regret bounds of i log(T )
P
∆ −1 and nT .
i

4 Linear Bandits Intro

Now suppose each arm i = 1, . . . , n has a feature vectors xi ∈ Rd . And more over, there exists some θ∗ ∈ Rd
such that a pull of arm It ∈ [n] results in a reward yt = hxIt , θ∗ i + ηt where ηt ∼ N (0, 1).
Applications: Drug-discovery, Spotify, Netflix, ads
In the previous setup, pulling arm i provided no information about arm j, but now suddenly it does.

4.1 Least Squares

Given a sequence of arm choices and observed rewards let {xt , yt , ηt }τt=1 we denote the stacked sequences of
each as X ∈ Rτ ×d , Y ∈ Rτ , and η ∈ Rτ respectively where Y = Xθ∗ + η. Using this information we can
derive a least-squares estimate of θ∗ given as follows
θ̂ = (X T X)−1 X T Y = (X T X)−1 X T (Xθ∗ + η) = θ∗ + (X T X)−1 X T η.
Fix any z ∈ Rd , then Thus
z > (θ̂ − θ∗ ) = z > (X > X)−1 X > η.

Note that η ∼ N (0, I). For any W ∼ N (µ, Σ) we have AW + b ∼ N (Aµ + b, AΣA> ). Thus

z > (θ̂ − θ∗ ) ∼ N (0, z > (X > X)−1 z).

so that
q
>
P z (θ̂ − θ∗ ) ≥ 2z > (X > X)−1 z log(1/δ) ≤ δ.

We will use the notation kzk2A = z > Az so that with probability at least 1 − δ
p
z > (θ̂ − θ∗ ) ≤ kzk(X > X)−1 2 log(1/δ)

4.1.1 Aside: Gaussian to sub-Gaussian

For an arbitrary constant µ,

P (xT (θ̂ − θ∗ ) > µ) = P (wT η > µ)

≤ exp(−λµ)E[exp(λwT η)], let λ > 0 Chernoff Bound
t
X
= exp(−λµ)E[exp(λ wi ηi )]
i=1
t
Y
= exp(−λµ) E[exp(λwi ηi )] independence of wi ηi
i=1
Yt
≤ exp(−λµ) exp(λ2 wi2 /2) sub-Gaussian assumption
i=1
λ2
= exp(−λµ) exp( ||w||22 )
2
µ2 µ
≤ exp(− ) λ=
2||w||22 ||w||22
µ2
= exp(− T T −1 ) = δ,
2x (X X) x

8
where in the final step we made use of the following equality

||w||22 = xT (X T X)−1 X T X(X T X)−1 x = xT (X T X)−1 x.

Thus with probability at least 1 − δ,

r
T 1
x (θ̂ − θ∗ ) ≤ 2xT (X > X)−1 x log( )
δ
p
=: kxk(X > X)−1 2 log(1/δ)

5 Experimental design
Note that if I take measurements (x1 , . . . , xn ) ∈ X and observe their corresponding observations yi =
hxi , θ∗ i + ηi where ηi ∈ 0, ∞, then E[(θb − θ)(θb − θ)> ] = σ 2 (X T X)−1 and also, θb − θ∗ ∼ N (0, σ 2 (X > X)−1 ).
We can visualize this as a confidence ellipsoid for each choice of X. And we can even think of optimizing
1 −x> Σ−1 x/2
the choice. Recall that the PDF of a Gaussian is φ(x) = (2π|Σ|) d/2 e . With entropy 12 log(2πe|Σ|).
When the number of selected points is large, its more convenient to think of sampling n points from a
distribution placed over X . Define
X
Aλ = λx xx>
x∈X

so that for every X ∈ Rτ ×d there exists some λ ∈ 4X such that X > X = >
P
x∈X dλx τ exx = Aλ . This Aλ
can then be used to shape the covariance θ:
b

• A-optimality: minimize fA (λ) = Tr(A−1 2

λ ) minimizes E[kθ − θk2 ]
b

• E-optimality: minimize fE (λ) = maxu:kuk≤1 u> A−1 2

λ u minimizes maxu:kuk≤1 E[(hu, θ − θi) ]
b

• D-optimality: maximize gD (λ) = log(|Aλ |) maximizes the entropy of distribution. Also, if Eλ = {x :

x> A−1
λ x ≤ d} then D-optimality is the minimum volume ellipsoid that contains X .

• G-optimality: minimize fG (λ) = maxx∈X x> A−1 ∗ 2

λ x minimizes maxx∈X E[(hx, θ − θ i) ]
b

Lemma 3 (Kiefer-Wolfowitz (1960)). For any X with d = dim(span(X )), there exists a λ∗ ∈ 4X that
• maxλ gD (λ) = gD (λ∗ )
• minλ fG (λ) = fG (λ∗ )
• fG (λ∗ ) = gD (λ∗ ) = d

• support(λ∗ ) = (d + 1)d/2
Proposition 2. If λ∗ is the G-optimal design for X then if we pull arm x ∈ X exactly dτ λ∗x e times for some
τ > 0 and compute the least squares estimator θ. b Then for each x ∈ X we have with probability at least 1 − δ
p
hx, θb − θ∗ i ≤ kxk(Px∈X dτ λ∗x exx> )−1 2 log(1/δ)
1 p
≤ √ kxk(Px∈X λ∗x xx> )−1 2 log(1/δ)
τ
r
2d log(1/δ)
≤
τ

and we have taken at most τ + d(d+1) pulls. Thus, for any δ 0 ∈ (0, 1) we have P( x∈X {|hx, θb − θ∗ i| >
S
q 2
2d log(2|X |/δ 0 )
τ }) ≤ δ 0 .

9
Notes:
• The support size of (d + 1)d/2 is trivial application of Caratheodory’s theorem. Many algorithms to
find this efficiently.
• Note that one can find a λ∗ with a constant approximation with just support O(d).

• Leverage scores if V -optimality

• John’s ellipsoid is equivalent to G/D-optimality
[Pukelsheim, 2006, Yu et al., 2006]. [Yu et al., 2006, Soare et al., 2014, Soare, 2015, Lattimore and Szepesvari, 2017],

6 Linear Bandits: Regret Minimization

This section is inspired by [Lattimore and Szepesvári, 2020].
Input: Finite set X ⊂ Rd , confidence level δ ∈ (0, 1).
Let Xb1 ← X , ` ← 1
while |Xb` | > 1 do
Let λb` ∈ 4 b be a d(d+1) -sparse minimizer of f (λ) = max kxk2P
X` 2 ( x∈X
λx xx> )−1
x∈X `
b` c

` = 2−` , τ` = 2d−2 2
` log(4` |X |/δ)
Pull arm x ∈ X exactly dλ b`,x τ` e times and construct the least squares estimator θb` using only
the observations of this round
Xb`+1 ← Xb` \ x ∈ Xb` : maxx0 ∈Xb` hx0 − x, θb` i > 2`
`←`+1
Output: Xb`
After T time steps, define the regret as
" T #
X
RT = hx? , θ∗ i − E hxt , θ∗ i
t=1
 
X
= E Tx ∆x 
x6=x?

where ∆x = hx? − x, θ∗ i.

Lemma 4. Assume that maxx∈X hx? − x, θ∗ i ≤ 4. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 8` for all ` ∈ N.
Proof. For any V ⊆ X and x ∈ V define

Ex,` (V) = {|hx, θb` − θ∗ i| ≤ ` }

where it is implicit that θb` is the G-optimal design constructed in the algorithm at stage ` with respect
to Xb` = V. Note that this is precisely the analogous events of multi-armed bandits. The key piece of the

10
analysis is that
   
∞ [
[ ∞
X [
c c
P {Ex,` (Xb` )} ≤ P {Ex,` (Xb` )}
`=1 x∈X
b` `=1 x∈X
b`
∞
!
X X [
c
= P {Ex,` (V)}, Xb` =V
`=1 V⊆X x∈V
∞ X
!
X [
c
= P {Ex,` (V)} P(Xb` = V)
`=1 V⊆X x∈V
X∞ X
δ|V|
≤ 2`2 |X | P(X`
b = V) ≤ δ
`=1 V⊆X

T T∞
Thus, in what follows, assume E := x∈X `=1 {Ex,` (Xb` )} holds.
Fix any ` for which x? ∈ Xb` (note x? ∈ Xb1 ). Then for any x ∈ Xb` we have

hx − x? , θb` i = hx, θb` − θ∗ i − hx? , θb` − θ∗ i + hx − x? , θ∗ i

≤ 2`

which implies x? ∈ Xb`+1 . Thus, x? ∈ Xb` for all `. On the other hand, any x for which hx? − x, θ∗ i > 4` we
have

max hx0 − x, θb` i ≥ hx? − x, θb` i

x0 ∈ X
b`

= hx? , θb` − θ∗ i − hx, θb` − θ∗ i + hx? − x, θ∗ i

> 2`

which implies maxx∈Xb`+1 hx, θ∗ i ≥ hx? , θ∗ i − 4` = hx? , θ∗ i − 8`+1 .

For any ` ≥ dlog2 (8∆−1 )e we have that Xb` = {x? }. Suppose you run for T timesteps. Then for any ν ≥ 0
the regret is bounded by:
dlog2 (8(∆∨ν)−1 )e
X X
∆x Tx = T ν + 8` (|support(λ
b` )| + τ` )
x∈X \x? `=1

dlog2 (8(∆∨ν)−1 )e
X
= Tν + 8` ( (d+1)d
2 + 2d−2 2
` log(4` |X |/δ))
`=1
dlog2 (8(∆∨ν)−1 )e
X
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) −1
)e + 16d−1 2
` log(4` |X |/δ)
`=1
dlog2 (8(∆∨ν)−1 )e
X
−1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 16d log(4 log22 (16(∆ ∨ ν) )|X |/δ) 2`
`=1
−1 −1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 512d(∆ ∨ ν) log(4 log22 (16(∆ ∨ ν) )|X |/δ)

Setting ν = 0 yields a regret bound of O(d∆−1 log(|X | log(∆−1 )/δ)) which implies RT ≤ c ∆
d
log(|X |T ). Mini-
p p
mizing over ν > 0 yields a regret bound of O( dT log(log(T /d)|X |/δ)) which implies RT ≤ c dT log(|X |T ).

Remarks:
• Let X = {ei : i ∈ [d]}. Then for this action set, this bound is nearly minimax according to our lower
bounds!

11
Pd
• However, this is also concerning: we know that in the bandit setting the regret scales like i=2 ∆−1
i log(T )
but this scales d∆−1 log(T ), which is significantly worse. Can we achieve this?
• For pure-exploration, an analogous analysis shows that one can identify the best-arm in ∆d2 log(1/δ)
pulls. But this is exactly the same rate we would have gotten if we did G-optimal once in the beginning
and sample according to that!

• Optimism won’t help here

7 Linear Bandits: Pure exploration

This section is inspired by [Fiez et al., 2019].
Showing that x? is the best arm is equivalent to showing that hx? − x, θ∗ i > 0 for all x ∈ X \ x? . Given
a finite number of observations, we have an estimate θb and a confidence set for θ∗ .
b = hx? − x, θb − θ∗ i + hx? − x, θ∗ i
hx? − x, θi
= hx? − x, θb − θ∗ i + ∆x
p
Recalling above, we have for any vector z ∈ Rd that |hz, θb − θ∗ i| ≤ kzk(X > X)−1 2 log(1/δ) w.p. ≥ 1 − δ.
We need to show that this confidence set is completely inside the x? region. Where we need to decrease
uncertainty is in the directions x − x? , clearly, which is not the G-optimal design. The most realistic
optimization program

ρ? := inf τ
λ∈4X ,τ ∈N

kx? − xk2(P τ λx xx> )−1 1

x∈X
subject to max ≤
x∈X ∆2x 2
kx? − xk2(P λx xx> )−1
x∈X
= inf max
λ∈4X x∈X ∆2x

Once can prove a lower bound of log(1/2.4δ)ρ? .

Input: Finite set X ⊂ Rd , confidence level δ ∈ (0, 1).
Let Xb1 ← X , t ← 1
while |Xb` | > 1 do
Let λb` ∈ 4X be a d(d+1) -sparse minimizer of f (λ; Xb` ) where
2
f (V) = inf f (λ; V) = inf max
0
kx − x0 k2(P λx xx> )−1
λ∈X λ∈X x,x ∈V x∈X

Set ` = 2−` , τ` = 2−2 2

` f (X` ) log(4` |X |/δ)
b
Pull arm x ∈ X exactly dτ` λ`,x e times and construct θb`
b
Xb`+1 ← Xb` \ x ∈ Xb` : maxx0 ∈Xb` hx0 − x, θbt i > `

t←t+1
Output: Xbt+1

Lemma 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 4` for all ` ∈ N.

Proof. For any V ⊆ X and x ∈ V define

Ex,` (V) = {|hx − x? , θb` − θ∗ i| ≤ ` }

12
where it is implicit that θb` is the design constructed in the algorithm at stage ` with respect to Xb` = V.
Given Xb` , with probability at least 1 − 2`2δ|X |
p
|hx − x? , θb` − θ∗ i| ≤ kx − x? k(Px∈V dτ` λ`,x (V)exx> )−1 2 log(4`2 |X |/δ)
kx − x? k(Px∈V λ`,x (V)xx> )−1 p
≤ √ 2 log(4`2 |X |/δ)
τ`
v
u kx − x? k2(P
u
> −1 p
x∈V λ`,x (V)xx )
≤t −2 2
2 log(4`2 |X |/δ)
2` f (V) log(4` |X |/δ)
= `
T∞ T
By exactly the same sequence of steps as above, we have P( `=1 x∈Xb` {|hx − x? , θbt − θ∗ i| > t }) =
T T∞
0
P `=1 Ex,` (X` ) ≥ 1 − δ, so assume these events hold. Consequently, for any x ∈ X`
b b
x∈X

hx0 − x? , θb` i = hx0 − x? , θb` − θ∗ i + hx0 − x? , θ∗ i

≤ hx0 − x? , θb` − θ∗ i
≤ `

so that x? would survive to round ` + 1. And for any x ∈ Xb` such that hx? − x, θ∗ i > 2` we have

max hx0 − x, θb` i ≥ hx? − x, θb` i

x0 ∈X
b`

= hx? − x, θb` − θ∗ i + hx? − x, θ∗ i

> −` + 2`
= `

which implies this x would be kicked out. Note that this implies that maxx∈Xb`+1 hx? −x, θ∗ i ≤ 2` = 4`+1 .

Theorem 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. Then with probability at least 1 − δ, x? is returned from
the algorithm at a time τ that satisfies

τ ≤ cρ? log(∆−1 )[log(1/δ) + log(log(∆−1 )) + log(|X |)].

Proof. Define S` = {x ∈ X : hx? − x, θ∗ i ≤ 4` }. Note that by assumption X = Xb1 = S1 . The above lemma
T∞
implies that with probability at least 1 − δ we have `=1 {Xb` ⊆ S` }. This implies that

f (Xb` ) = min max kx − x0 k2(P λx xx> )−1

λ∈4X x,x0 ∈X
b` x∈X

≤ min max
0
kx − x0 k2(P λx xx> )−1
λ∈4X x,x ∈S` x∈X

= f (S` )

13
For ` ≥ dlog2 (4∆−1 )e we have that S` = {x? }, thus, the sample complexity to identify x? is equal to

dlog2 (4∆−1 )e dlog2 (4∆−1 )e

X X X (d+1)d
dτ` λ
b`,x e =
2 + τ`
`=1 x∈X `=1
dlog2 (4∆−1 )e
X (d+1)d
= 2 + 2−2 b 2
` f (X` ) log(4` |X |/δ)
`=1
dlog2 (4∆−1 )e
(d+1)d
X
≤ 2 dlog2 (4∆−1 )e + 2−2 2
` f (S` ) log(4` |X |/δ)
`=1
dlog2 (4∆−1 )e
4 log22 (8∆−1 )|X |
(d+1)d
X
≤ 2 dlog2 (4∆−1 )e + 4 log( δ ) 22` f (S` ).
`=1

We now note that

kx − x? k2(P λx xx> )−1
ρ? = inf max x∈X

λ∈4X x∈X (hx − x? , θ∗ i)2

kx − x? k2(P λx xx> )−1
x∈X
= inf max max
λ∈4X `≤dlog2 (4∆−1 )e x∈S` (hx − x? , θ∗ i)2
dlog2 (4∆−1 )e kx − x? k2(P
1 X
x∈X λx xx> )−1
≥ inf max
dlog2 (4∆−1 )e λ∈4X x∈S` (hx − x? , θ∗ i)2
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` inf max kx − x? k2(P λx xx> )−1
16dlog2 (4∆−1 )e λ∈4X x∈S` x∈X
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` inf max kx − x0 k2(P λx xx> )−1
64dlog2 (4∆−1 )e λ∈4X x,x0 ∈S` x∈X
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` f (S` )
64dlog2 (4∆−1 )e
`=1

where we have used the fact that maxx,x0 ∈St kx − x0 k2(P ? 2P

> −1 ≤ 4 maxx∈St kx − x k( > −1 by
x∈X λx xx ) x∈X λx xx )
the triangle inequality.

8 Linear bandits: regret minimization revisited

Okay, now that we know how to do optimal pure exploration, how do we turn this into an algorithm that is
optimal? PT
Let RT (X , θ) = Eθ [ t=1 ∆Xt ], ∆x = maxx0 ∈X hx0 − x, θi
The next theorem is from [Lattimore and Szepesvári, 2020].
Theorem 6. Fix any X ⊂ Rd that spans Rd and θ∗ ∈ Rd such that arg maxx∈X hx, θ∗ i is unique. Any policy
(X ,θ ∗ )
for which RT (X , θ∗ ) = o(T a ) for any a > 0 also satisfies lim inf T →∞ RTlog(T ?
) ≥ r where
X
r? := inf αx ∆x
α∈[0,∞)X
x∈X
kx? − xk2(P αx xx> )−1 1
x∈X
subject to max ≤
x∈X ∆2x 2

14
Note that
1X
ρ? := inf αx
α∈[0,∞)X 2
x∈X
kx? − xk2(P αx xx> )−1 1
x∈X
subject to max ≤
x∈X ∆2x 2
Notes
• There exists an asymptotic algorithm [Lattimore and Szepesvari, 2016], but no satisfying finite-time
algorithm as of yet.
• Information directed sampling may be near-optimal and very high performance.

9 Adversarial Linear Bandits

Protocal for Linear Bandits
Input: Time horizon T , action set A ⊂ Rd .
Initialize: Adversary chooses {zt }Tt=1 ⊂ Rd .
for: t = 1, · · · , T
Player chooses action At ∈ A
Player suffers (and observes) loss `(At , zt ) = A>
t zt

9.1 Preliminaries
This presentation of mirror descent follows [Bubeck et al., 2012, Ch. 5].
For any open convex set D ⊂ Rd and its closure denoted D̄, for any Legendre F on D̄ define F ∗ (x) :=
supy∈D̄ x> y − F (y).
Define DF (x, y) = F (x) − F (y) − (x − y)> ∇F (y).
Let the Mirror Descent iterations satisfy, a1 = arg mina∈A F (a) then

at+1 = ∇F ∗ (∇F (at ) − η∇`(at , zt ))

e (1)
at+1 = arg min DF (a, e
at+1 ) (2)
a∈A

where we have assumed the iterates exist.

Theorem 7 (Online Mirror Descent). Let A ⊂ Rd be a closed convex action set, ` a subdifferentiable loss,
and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ dom(∇F (D)) for all (a, z) ∈ A × Z
is satisfied. Then OMD satisfies
T T
X supa∈A F (a) − F (a1 ) 1 X
sup `(at , zt ) − `(a, zt ) ≤ + DF ∗ (∇F (at ) − η∇`(at , zt ), ∇F (at )) .
a∈A t=1 η η t=1

Online Mirror Descent with Linear Losses

Input: Time horizon T , convex action set A ⊂ Rd .
Initialize: Player sets a1 = arg mina∈A F (a). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player suffers (and observes) loss `(at , zt ) = a>
t zt
Player observes zt
Update mirror descent iterates:

at+1 = ∇F ∗ (∇F (at ) − ηzt )

e
at+1 = arg min DF (a, e
at+1 )
a∈A

15
Corollary 1 (Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a closed convex action set,
{zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
T T
X
> supa∈A F (a) − F (a1 ) 1 X
sup (at − a) zt ≤ + DF ∗ (∇F (at ) − ηzt , ∇F (at )) .
a∈A t=1 η η t=1

Stochastic Online Mirror Descent with Linear Losses

Input: Time horizon T , action set A ⊂ Rd .
Initialize: Player sets a1 = arg mina∈A F (a). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T P
Player chooses distribution Pt over A with at = E[At |Pt ] = a∈A aPt (a)
Player samples At from Pt and suffers (and observes) loss `(At , zt ) = A>
t zt
zt |Pt ] = zt
Player computes estimate zbt with E[b
Update mirror descent iterates:

at+1 = ∇F ∗ (∇F (at ) − ηb

e zt )
at+1 = arg min DF (a, e
at+1 )
a∈convhull(A)

Corollary 2 (Stochastic Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a finite action set,
zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
{b
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
" T # T
X
> supa∈A F (a) − F (a1 ) 1 X
sup E (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] .
a∈A t=1
η η t=1

Proof. Applying Corollary 1 with zbt we have

T T
X supa∈A F (a) − F (a1 ) 1 X
sup (at − a)> zbt ≤ + DF ∗ (∇F (at ) − ηb
zt , ∇F (at )) .
a∈A t=1 η η t=1

Taking the expectation on both sides yields the result by noting that
" T # " T # " T # " T # " T #
X X X X X
> > > > >
E zt (At − a) = E zt (at − a) = E E[zt (at − a)|Pt ] = E zt (at − a)|Pt ] = E
E[b zbt (at − a) .
t=1 t=1 t=1 t=1 t=1

9.2 Simplex with unnormalized negative entropy

Pd Pn
Example 1 Let A = {x ∈ Rd : xi ≥ 0, i=1 = 1}, F (x) = i=1 xi log(xi ) − xi with D = (0, ∞)d . F is
Legendre and

[∇F (x)]i = log(xi )

d
X d
X
DF (x, y) = xi log( xyii ) − (xi − yi )
i=1 i=1
d
X
F ∗ (x) = exp(xi )
i=1
[∇F ∗ (x)]i = exp(xi )
d
X
D F∗ (x, y) = exp(yi )(exp(xi − yi ) − 1 − (xi − yi ))
i=1

16
9.2.1 Full information game
Assume the setting of Example 1. The following algorithm implements the updates of Mirror Descent above
for the particular loss `(a, z) = a> z.

Exponential Weights
Input: Time horizon T .
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player chooses at ∈ A
Player suffers (and observes) loss `(at , zt ) = a>
t zt
Player observes zt
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 3 (Exponential weights). Under the conditions of Example 1 with let `(a, z) = a> z, the expo-
nential weights algorithm satisfies
T
X log(d) ηT p
sup `(at , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2

Proof. Note ∇a `(a, z) = z. Plug in quantities of the example to obtain for any a ∈ A
T
X T
X
`(at , zt ) − `(a, zt ) = zt> (at − a)
t=1 t=1
T d
log(d) 1 X X
≤ + at,i (exp(−ηzt,i ) − 1 + ηzt,i )
η η t=1 i=1
T d
log(d) η X X 2
≤ + at,i zt,i
η 2 t=1 i=1
log(d) ηT
≤ +
η 2

where the second line uses F (x) ≤ 0 and F (a1 ) = log(d), the third line uses exp(−x) ≤ 1 − x + 21 x2 for
x ≥ 0, and the last line follows from zt,i ∈ [0, 1] and at is a probability distribution.

9.2.2 Full information, finite action set

Analogous to the categorial weather prediction problem in class, we now consider the case where the player
can only play from a distinct set {1, . . . , d} (i.e., predict rain, snow, sunny). As discussed in class, any
deterministic algorithm will suffer linear regret, so instead, at time t we choose a probability distribution
at ∈ A (in the setting of Example 1), choose distinct action It drawn according to at so that At := eIt , and
Pd
then suffer loss `(At , zt ) = A>
t zt = zt,It . Note that E[At ] = E[eIt ] = i=1 ei at,i = at so that E[`(At , zt )] =
E[A> >
t zt ] = at zt . Thus, the expected regret relative to any probability distribution a ∈ A over distinct items

17
in hindsight is
" T
# " T #
X X
E `(At , zt ) − `(a, zt ) = E zt> (At − a)
t=1 t=1
" T #
X
=E zt> (at − a)
t=1
" T
#
X
=E `(at , zt ) − `(a, zt )
t=1

where the expectation is with to the random selection of each It from at . Alternatively, we can directly
apply Corollary 2 with zbt = zt since we are in the full information setting.

Exponential Weights over finite actions

Input: Time horizon T .
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player chooses at ∈ A
Player draws It ∼ at , sets At = eIt and suffers (and observes) loss `(At , zt ) = A>
t zt =
zt,It
Player observes zt
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 4 (Exponential weights over finite actions). Under the conditions of Example 1 where the player
can only play ei for i ∈ {1, . . . , d} with let `(a, z) = a> z, the exponential weights over finite actions algorithm
satisfies
" T
#
X log(d) ηT p
E sup `(eIt , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2

Proof. Immediate from reduction described above and the previous corollary since the iterates are identical
in the full information game. Due to the oblivious adversary we have
" T # " T
#
X X
sup E `(at , zt ) − `(a, zt ) = E sup `(at , zt ) − `(a, zt )
a∈A t=1 a∈A t=1

Note, as we did in class, one can also prove a high probability bound that would apply to a general reactive
adversary [?, Ch. 2.7]

9.2.3 Bandit feedback over finite action sets

This setting is identical to the previous setting, but now we do not observe the entire vector zt at each time
t, we only observe the element we played zt,It . Using this single value, the player constructs an unbiased
1{It =i}zt,i
estimate of zt with zbt,i = at,i for all i. Note that

1{It = i}zt,i
zt,i |a1 , I1 , . . . , at−1 , It−1 , at ] = E[
E[b |a1 , I1 , . . . , at−1 , It−1 , at ]
at,i
d
X 1{j = i}zt,i
= at,j
j=1
at,i
= zt,i .

18
Pd
Also note that E[At ] = E[eIt ] = i=1 ei at,i = at .

EXP3: Exponential Weights for Exploration Exploitation

Pd
Input: Time horizon T , A = {x ∈ Rd : xi ≥ 0, i=1 = 1}.
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player draws It ∼ at and suffers (and observes) loss `(eIt , zt ) = zt,It
1{It =i}zt,i
Player sets zbt,i = at,i
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zbs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 5 (EXP3). Under the conditions of Example 1 where the player can only play ei for i ∈ {1, . . . , d}
with `(a, z) = a> z and only observe bandit feedback, the EXP3 algorithm satisfies
" T # T
" d #
X log(d) η X X
2
sup E `(At , zt ) − `(a, zt ) ≤ + E at,i zbt,i
a∈A t=1
η 2 t=1 i=1
log(d) ηT d p
≤ + ≤ 2dT log(d)
η 2
Proof. We can directly apply Corollary 2:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb zt , ∇F (at ))]
a∈A t=1 η η t=1
T
" d #
log(d) 1 X X
= + E at,i (exp(−ηb zt,i ) − 1 + ηb
zt,i )
η η t=1 i=1
T
" d #
log(d) η X X
2
≤ + E at,i zbt,i
η 2 t=1 i=1
T
" d #
2
log(d) η X X 1{It = i}zt,i
= + E at,i
η 2 t=1 i=1
a2t,i
T d
log(d) η X X 2
= + z
η 2 t=1 i=1 t,i
log(d) ηdT
≤ +
η 2

For an alternative proof of EXP3, see [Lattimore and Szepesvári, 2020, Ch. 11]

9.3 Other action sets

The previous section addressed the case of the action set being equal to the simplex: A = {x ∈ Rd : xi ≥
Pd Pd
0, i=1 = 1}. As our Legendre potential we chose unnormalize negative entropy F (x) = i=1 xi log(xi )−xi .
Consider what the guarantee would be from Corollary 2 if we chose a different function, say, F (x) = 12 kxk22

19
then:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))]
a∈A t=1 η η t=1
T
1 ηX
zt k22

≤ + E kb
η 2 t=1
T
" d #
1 ηX X
2
= + E zbt,i
η 2 t=1 i=1
T
" d #
X zt,i 2
1 ηX
= + E .
η 2 t=1 a
i=1 t,i

The issue here is that at,i can become arbitrarily close to 0 and blow up the bound. If we mix in unfirom
2/3
exploration
p at each round, one can show that the regret bound is O((dT ) ) which is significantly worse
than O( dT log(d)) of EXP3 above. So given an action set A how do we choose F ? The next proposition
sheds some light on this question.
Proposition 3. If F is twice continuously differntiable, and if its Hessian ∇2 F (x) is invertible ∀x ∈ D,
then ∀x, y ∈ D, there exists ζ ∈ D such that ∇F (ζ) ∈ [∇F (x), ∇F (y)] and
1
DF ∗ (∇F (x), ∇F (y)) = k∇F (x) − ∇F (y)k2(∇2 F (ζ))−1 .
2
The implication of the above proposition is that ∃∇F (ζt ) ∈ [∇F (at ) − ηb
zt , ∇F (at )] and

η2
DF ∗ (∇F (at ) − ηb
zt , ∇F (at )) = zt k2(∇2 F (ζt ))−1
kb
2
For the choice of F (x) = 12 kxk22 we have ∇2 F (ζt ) = I for any ζt so that the Hessian is flat across the action
Pd
set. On the other hand, with the choice F (x) = i=1 xi log(xi )−xi , we have ∇2 F (x) = diag(1/x1 , . . . , 1/xd )
which blows up as a component of x approaches 0. But this is perfect, since then
2
η 2
E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] = E kb
zt k(∇2 F (ζt ))−1
2
" d
#
η2 X 2
=E zb ζt,i
2 i=1 t,i
" d
#
2
η 2 X zt,i
=E ζt,i
2 i=1 at,i
d
η2 X 2
≈ z
2 i=1 t,i

where we have used the approximation that ζt ≈ at . The hessian of F is blowing up precisely at the locations
where zbt blows up, essentially cancelling each other! We’ll see another example of this in the next subsection.

9.3.1 Unit ball action set

Here we address the action set A = {x ∈ Rd : kxk2 ≤ 1}. To use Corollary 2, we need to define Pt (or
equivalently, At ) to make sure that E[At ] = at and we need to define zbt with E[b
zt ] = zt . Consider the
following choices:
• ξt ∼ Bernoulli(kat k2 ), It ∼ uniform([d]), t ∈ {−1, 1} with equal probability
• At = (1 − ξt )t eIt + ξt kaattk2

20
d
• zbt = (1 − ξt ) 1−ka t k2
At A>
t zt

It is straightforward to verify that E[Ath|at ] = at and E[b zti|at ] = zt . Following the intuition of the previous
η2 2
section, we need to choose F to make E 2 kb zt k(∇2 F (ζt ))−1 small. Let F (x) = − log(1 − kxk2 ) − kxk2 . Note
that

h
2
i d > 2
E kbzt k(∇2 F (ζt ))−1 = E k(1 − ξt ) At At zt k(∇2 F (ζt ))−1
1 − kat k2
d2

> 2
=E kAt At zt k(∇2 F (ζt ))−1 |ξt = 0
1 − kat k2
d
X 1 d2
= kei e> 2
i zt k(∇2 F (ζt ))−1
i=1
d 1 − kat k2
d
= kzt k2(∇2 F (ζt ))−1
1 − kat k2
d
≤ (1 − kζt k2 )kzt k22
1 − kat k2
≤ 2d
1−kzt k2
where we have used the fact that ∇2 F (x) I/(1−kxk2 ) and 1−ka t k2
≤ 2 (see [Lattimore and Szepesvári, 2020]
for second fact). √
Using a shrunken action set, one can prove that the regret is bounded by O( dT ) (see [Lattimore and Szepesvári, 2020]).

9.3.2 Finite action sets

Here we study the case when |A| = n and A = {x1 , . . . , xn } ⊂ Rd and maxx∈A maxt |x> zt | ≤ 1. Our
approach will rely on a slight modification of the EXP3 algorithm:

EXP3(γ): Exponential Weights for Exploration Exploitation

Input: Time horizon T , n arms, η > 0, γ ∈ [0, 1], λ ∈ 4n .
Initialize: Player sets p1 = (1/n, . . . , 1/n) ∈ 4n . Adversary chooses {yt }Tt=1 ⊂ [−1, 1]n .
for: t = 1, · · · , T
Player draws It ∼ qt := (1 − γ)pt + γλ and suffers (and observes) loss `(It , yt ) = yt,It
Player computes ybt,i where E[byt,i |pt ] = yt,i
Update iterates:
t
X n
X
pet+1,i = exp(−η ybs,i ) pt,i = pet+1,i / pet+1,j .
s=1 j=1

A straightforward modification for the stochastic mirror descent proof accounts for the forced exploration:
Proposition 4 (EXP3(γ)). The regret of EXP3(γ) algorithm satisfies
" T # T
" n #
X log(n) 1X X
max E yt,It − yt,i ≤ + 2γT + E qt,i φ(−ηb
yt,i )
i∈[n]
t=1
η η t=1 i=1

where φ(x) = ex − 1 − x ≤ x2 for |x| ≤ 1.

We will use the EXP3(γ) algorithm as follows:
Pn
• Set At = xIt , Qt = i=1 qt,i xi x> bt = Q−1
i , z
>
t At At zt

• yt,i = x> bt,i = x>

i zt , y i z
bt
to obtain the following theorem:

21
Theorem 8. Let |A| = n and A = {x1 , . . . , xn } ⊂ Rd and maxx∈A maxt |x> zt | ≤ 1. Using the reduction to
EXP3(γ) with γ = ηd we have
" T #
X
> log(n) p
max E (At − a) zt ≤ + 3ηdT ≤ 3dT log(n)
a∈A
t=1
η

First note that

yt,i |pt ] = E[x>

E[b bt |pt ]
i z
−1
= E[x> >
i Qt At At zt |pt ]
−1
= x> >
i Qt E[At At |pt ]zt

= x>
i zt = yt,i

We also have
2
yt,i
E[b ] = E[(x> bt )2 |pt ]
i z
−1 > −1
= E[x> > 2
i Qt At At Qt xi (At zt ) |pt ]
−1
≤ x>
i Qt xi · h
2

so that
" n
# n
X X
−1
E 2
qt,i ybt,i ≤ h2 qt,i x>
i Qt xi
i=1 i=1
n
X
−1
≤ h2 Trace(qt,i xi x>
i Qt )
i=1
≤ h Trace(Qt Q−1
2
t ) = dL
2

Finally,

yt,i | = |ηx>
|ηb bt |
i z
−1
= |ηx> >
i Qt At At zt |
−1
≤ ηh|x>
i Qt At |
−1
= ηh|x>
i Qt xIt |
≤ ηhkxi kQ−1 kxIt kQ−1
t t

≤ ηhkxi k(γ Pni=1 λi xi x>

i )
−1 kxIt k(γ
Pn > −1
i=1 λi xi xi )

ηhd
≤
γ
so it suffices to take γ = ηhd.

10 Contextual Bandits
This section is inspired by [Lattimore and Szepesvári, 2020]
For t = 1, 2, . . .
iid
• Nature reveals ct ∼ D

• Player chooses xt ∈ X and observes yt = v(ct , xt ) + t

22
which models users showing up to websites, or patients showing up to the doctor with different symptoms.
A policy π maps contexts to actions in [n] and the value of a policy π is defiend as
V (π) = EC, [v(C, π(C)) + ] = EC [v(C, π(C))]
At each time, we assume the action taken is according to some policy πt ∈ Π so that the regret is defined as
XT
RT = T · V (π ? ) − E[ V (πt )]
t=1

Notes
• This is not the only way we could have defined regret. Perhaps an even more natural approach would
be to define the optimal policy wrt the actual rewards, now the expected rewards.
One naive way to handle this is to run a different linear bandit for each ct , but this is impractical for
infinite sets of ct .

10.1 Stochastic Linear model

Suppose v(c, x) = hφ(c, x), θ∗ i. This can be learned through historical data (e.g., a deep network), or through
polynomials of inputs. We can restate the above models as For t = 1, 2, . . .
• Nature reveals (xt,1 , . . . , xt,n ) = Xt ⊂ Rd
• Player chooses It ∈ [n] and observes yt = hxt,It , θ∗ i + t
When we had a fixed action set, we built confidence intervals on hxi , θb − θ∗ i. Now that we don’t
know what action sets to expect, a natural to assume maxi,t kxi,t k ≤ 1 and build confidence intervals
on supu:kuk2 ≤1 hu, θb − θ∗ i = kθb − θ∗ k2 , or equivalently, define a set Ct with the guarantee that θ∗ ∈ Ct
for all t. When an action set Xt shows up, we could eliminate all provably sub-optimal arms by setting
Xbt = X \ {x : maxx0 ∈X hx0 − x, θi < 0 ∀θ ∈ Ct }. An alternative is to run UCB, defining:
U CBt (x) = maxhx, θi
θ∈Ct

and play xt = arg maxx∈Xt U CBt (x). If x?t = arg maxx∈Xt hx, θ∗ i then

hx?t , θ∗ i ≤ U CBt (x?t ) ≤ U CBt (xt ) = hxt , θi.

Thus, the instantaneous regret at time t satisfies

rt = hx?t − xt , θ∗ i
≤ hxt , θe − θ∗ i
≤ kxt kA−1 kθe − θ∗ kAt−1
t−1
p
≤ 2kxt kA−1 βt−1
t−1

Thus, the random regret satisfies

v v
T
X
u T
u X
u
u T
X
R
bT = rt ≤ T
t 2
rt ≈ t2T βT kxt k2A−1
t−1
t=1 t=1 t=1

Let θ̂t be the `2 -regularized least-squares estimate of θ∗ with regularization parameter λ > 0 given by
θ̂t = arg min kX1:t θ − Y1:t k + λkθk22 = (XT1:t X1:t + λI)−1 XT1:t Y1:t
θ

where we are denoting X1:t as a matrix with rows X1T , X2T , . . . , XtT and Y1:t as the vector (Y1 , . . . , Yt )T . The
following theorem says that with high probability θ∗ lies with high probability in an ellipsoid with center at
θ̂t .

23
Theorem 9. Confidence Ellipsoid. Assume the same as in Theorem ??, let V = Iλ, λ > 0, define Yt =
hXt , θt i + ηt and assume that kθ∗ k ≤ S. Then for any δ > 0, with probability at least 1 − δ, for all t ≥ 0, θ∗
lies in the set
 s 
 det(V ) 1/2 ) det(λI)−1/2 
t
Ct = θ ∈ Rd : kθ̂ − θkV t ≤ R 2 log( ) + λ1/2 S .
 δ 

Furthermore, if for all t ≥ 1, kXt k2 ≤ L then with probability at least 1 − δ, for all t ≥ 0, θ∗ lies in the set
( r )
0 d 1 + tL2 /λ 1/2
Ct = θ ∈ R : kθ̂ − θkV t ≤ R d log( )+λ S .
δ

10.2 Stochastic Contextual Bandits for General policy classes

This section is inspired by Alekh’s Agarwal’s excellent notes on contextual bandits. See [Link]
[Link]/courses/cse599m/19sp/.
Let’s return to the general setting of trying to minimize V (π ? ) − V (π). The core diffuclty of contextual
bandits is that if I take action i and receive a reward with mean v(ct , i), I don’t observe v(ct , j) for some
j 6= i. If I don’t see that context ever again, how can I know what I should have done?

10.2.1 τ -greedy
Here we explore a simple strategy that explores for τ steps then exploits or the last T − τ steps. For any
context c fix an exploration distribution µ(x|c) ∈ 4n such that µ(x|c) > 0 for all x, c.
Then suppose at time t, I observed some context ct and played xt ∼ µ(·|c) to receive reward rt =
v(ct , xt ) + t . Then define the inverse propensity scoring estimator as
1{xt = x}
vb(ct , x) = rt
µ(x|ct )
Note that

1{xt = x}
v (ct , x)|ct ] = E rt
E[b |ct
µ(x|ct )

X 1{xt = x}
= µ(x0 |ct )E rt |ct , xt = x0
0
µ(x|ct )
x ∈X
X 1{x0 = x}
= µ(x0 |ct )v(ct , x0 )
µ(x|ct )
x0 ∈X
= v(ct , x)
with variance
v (ct , x) − v(ct , x))2 |ct ] ≤ E[(b
E[(b v (ct , x))2 |ct ]

1{xt = x}
≤E |c t (rt ≤ 1)
µ(x|ct )2
X 1{x0 = x}
≤ µ(x0 |ct )
0
µ(x|ct )2
x ∈X
1
=
µ(x|ct )
1
Pt
and trivially, vb(c, x) ≤ vbmax := maxc,x µ(x|c) . Finally, if Vbt (π) = 1s s=1 vb(ct , π(ct )) then E[Vbt (π)] = V (π)
for any π ∈ Π. By Bernstein’s inequality we have with probability at least 1 − δ
s
2 log(2/δ) 1 2b
vmax log(2/δ)
|V (π) − Vt (π)| ≤
b Ec,x∼µ +
t µ(π(c)|c) 3t

V (b π ) − Vbt (b
π ) = V (b π ) − Vbt (π ? ) + Vbt (π ? ) − V (π ? ) +V (π ? )
π ) + Vbt (b
| {z } | {z } | {z }
≥−t ≥0 ≥−t
?
≥ V (π ) − 2t

If we explore uniformly for τ rounds than exploit or T − τ rounds then we achieve a regret of at most
r
˙ − τ ) ≤ τ + T 4n log(2|Π|/δ)
1 · t + 2τ (T
τ
which is minimized at τ = (nT 2 log(2|Π|/δ))1/3 which yields a regret of O(T 2/3 (n log(2|Π|/δ))1/3 ) regret.
1
Pm
Lemma 6 (Bernstein’s inequality). Let X1 , . . . , Xm be independent random variables such that m i=1 E[(Xi −
E[Xi ])2 ] ≤ σ 2 and |Xi | ≤ B. Then
m
r
1 X 2σ 2 log(2/δ) 2B log(2/δ)
Xi ≤ +
m i=1 m 3m

with probability at least 1 − δ.

10.2.2 Action Elimination

We will make the strong assumption that the distribution of contexts is known a priori. Not so implausible
due to historical data.
Recall the value of a policy π ∈ Π as V (π) = Ec [v(c, π(c))]. Taking our G-optimality approach, we wish
to sequentially define probability vectors at each time and play probabilistically following

min max E[(Vbt (π) − V (π))2 ]

λ∈∆Π
b π∈Π
b

b ⊂ Π.
for some active set Π
Input: Policy set Π such that π : X → [n] for all π ∈ Π, confidence level δ ∈ (0, 1).
Let Πb 1 ← Π, ` ← 1
while |Π b ` | > 1 do
q
log(2|Π|T /δ)
` = 2−` , τ` = d16n−2 ` log(2|Π|T /δ)e, γ` = min{ 2n ,
1
9nτ`
}
h i
1
Q` = arg minQ∈4Πb maxπ∈Π b ` EC Qγ (π(C)|C)
`
γ P
s.t. Q (x|c) = γ + (1 − γn) π∈Π b ` :π(c)=x Q(π)
for t = T`−1 + 1, . . . , T`
Observe context ct
Play xt ∼ Qγ (·|ct ), set pt = Qγ (xt |ct ) and observe reward rt = v(ct , xt ) + ηt
1
P 1{π(ct )=ct }
Set Vb` (π) = T` −T `−1 t∈(T`−1 ,T` ] rt pt
b ` | max 0 b Vb` (π 0 ) − Vb` (π) ≥ 2`

b `+1 ← Π
Π b` \ π ∈ Π
π ∈Π`
t←t+1
Output: Πt+1
We state without proof:
1
Lemma 7. For any finite policy set Π and γ ≤ 2n we have

1
min max EC ≤ 2n
Q∈4Π π∈Π Qγ (π(C)|C)

25
Lemma 8. For all ` = 1, 2, . . . we have π ? ∈ Π
b ` and max b V (π) ≥ V (π ? ) − 8` .
π∈Π`
h i
1
Proof. Let τ` = T` − T`−1 . Noting that the variance of Vb` (π) is bounded by maxπ∈Π
b ` EC Qγ
, we
` (π(C)|C)
apply Bernstein’s inequality at each stage ` to find
s
4n log(2|Π|T /δ) 2 log(2|Π|T /δ)
|V (π) − Vb` (π)| ≤ +
τ` 3γ` τ`
s
16n log(2|Π|T /δ)
≤
τ`
q
1 log(2|Π|T /δ)
for the choice of γ` = min{ 2n , 9nτ` } to equalize the terms for large τ` . The last inequality holds if
τ` ≥ n log(2|Π|T /δ). To make the right hand side less than ` , it suffices to take τ` = d16n−2
` log(2|Π|T /δ)e.
b ` with π ? ∈ Π
For any fixed Π b ` , we have that any π ∈ Π
b ` satisfies

Vb` (π) − Vb` (π ? ) = Vb` (π) − V (π) + V (π) − V (π ? ) +V (π ? ) − Vb` (π ? )

| {z }
≤0

≤ 2` .

On the other hand, for any π such that V (π ? ) − V (π) > 4`

max Vb` (π 0 ) − Vb` (π) ≥ Vb` (π ? ) − Vb` (π)

π 0 ∈Π
b`

= Vb` (π ? ) − V (π ? ) + V (π ? ) − V (π) +V (π) − Vb` (π)

| {z }
>4`

> 2`
? ?
which implies this π will be kicked out. This means that maxπ∈Π b `+1 V (π) ≥ V (π ) − 4` ≥ V (π ) − 8`+1 .

Extending the proof to all ` and random Π b ` is identical to above for linear bandits.

Suppose you run for T timesteps. Let ∆ = minπ6=π? V (π ? ) − V (π). Then for any ν ≥ 0 the regret is
bounded by:

dlog2 (4(∆∨ν)−1 )e
X
T ν+ (γ` n + 8` (1 − γ` n))τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X log(2|Π|T /δ)
= Tν + (n 9nτ` + 8` )τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X
= Tν + n d16n−2 −2
` log(2|Π|T /δ)e log(2|Π|T /δ)/9n + 8` d16n` log(2|Π|T /δ)e
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ Tν + 2n−1 −1
` log(4|Π|T /δ) + 128n` log(4|Π|T /δ)
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ T ν + 8 + 130n log(2|Π|T /δ) 2t
t=1
≤ T ν + 8 + 2080n(∆ ∨ ν)−1 log(2|Π|T /δ).

p using the upper bound (∆ ∨ ν) ≤ ν and optimizing over ν we have that the regret is no great
As before,
than O( nT log(|Π|T /δ)).

26
10.3 EXP4 for oblivious adversaries
The following algorithm and proof are very standard. However, the textbooks [Lattimore and Szepesvári, 2020,
Bubeck et al., 2012] have some unfortunate typos and/or notation that I found confusing so I have repro-
duced EXP4 here.

EXP4: Exponential Weights for Exploration Exploitation

Input: Time horizon T , n arms, m expers, η > 0, γ ∈ [0, 1], λ ∈ 4n .
Initialize: Player sets Q1 = (1/m, . . . , 1/m) ∈ [0, 1]1×m . Adversary chooses {`t }Tt=1 ⊂
[0, 1]n .
for: t = 1, · · · , T
Nature reveals expert advice E (t) ∈ [0, 1]m×n
(t) Pm (t)
Set pt,i = EM ∼Qt [EM,i ] = j=1 Qt,j Ej,i
(t)
Player draws Mt ∼ Qt and It ∼ EMt ∈ 4n (equivalent to It ∼ pt ∈ 4n )
Player suffers (and observes) loss `t,It
1{It =i}`t,i
Estimate arm losses `bt,i = p
Pt,in (t)
Estimate expert losses ybt,j = i=1 Ej,i `bt,i for all j = 1, . . . , m
Update iterates:
t
X m
X
Q
e t+1,i = exp(−η ybs,i ) Qt,i = Q
e t+1,i / Q
e t+1,j .
s=1 j=1

First we will prove some simple identities:

n n
(t) (t)
X X
yt,j ] = E[
E[b Ej,i `bt,i ] = Ej,i `t,i
i=1 i=1

and
n n X
m m n m n
(t) (t)
X X X X X X
E[`t,It ] = E[ pt,i `t,i ] = Qt,j Ej,i `t,i = Qt,j Ej,i `t,i = E[ Qt,j ybt,j ]
i=1 i=1 j=1 j=1 i=1 j=1 i=1

The expert regret is defined as

" #  
T n T X
m n n
(t) (t) (t)
X X X X X
max E `t,It − Ek,i `t,i = max E  Qt,j Ej,i `t,i − Ek,i `t,i 
k=1,...,m k=1,...,m
t=1 i=1 t=1 j=1 i=1 i=1
 
XT X
m
= max E  Qt,j ybt,j − ybt,k 
k∈[m]
t=1 j=1
" T #
X
= max E ybt,Mt − ybt,k
k∈[m]
t=1
 
T m
log(m) η X X
2 
≤ + E Qt,j ybt,j
η 2 t=1 j=1

log(m) ηnT
≤ +
η 2
where the first two lines follow from plugging in the identities of above, the third from the definition of Mt ,

27
and the first inequality follows from the guarantee of EXP3. The last inequality follows from
" n #
X (t)
2
yt,j
E[b ]=E ( Ej,i `bt,i )2
i=1
 !2 
n
(t) 1{It = i}`t,i
X
= E Ej,i 
i=1
pt,i

(t)
!2 
Ej,It `t,It
=E  
pt,It

n (t)
!2
X Ej,i `t,i
= pt,i
i=1
pt,i
n (t)
X Ej,i
≤
i=1
pt,i

so
 
m m n (t)
X
2 
X X Ej,i
E Qt,j ybt,j = Qt,j
j=1 j=1 i=1
pt,i
n Pm (t)
X j=1 Qt,j Ej,i
= =n
i=1
pt,i

28
References
[Audibert and Bubeck, 2009] Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic
bandits.
[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256.
[Bubeck et al., 2012] Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122.
[Cappé et al., 2013] Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback–leibler
upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541.
[Fiez et al., 2019] Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. (2019). Sequential experimental design for
transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676.
[Kaufmann et al., 2016] Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identifi-
cation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
[Lattimore, 2018] Lattimore, T. (2018). Refining the confidence level for optimistic bandit strategies. The Journal
of Machine Learning Research, 19(1):765–796.
[Lattimore and Szepesvari, 2016] Lattimore, T. and Szepesvari, C. (2016). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491.
[Lattimore and Szepesvari, 2017] Lattimore, T. and Szepesvari, C. (2017). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737.
[Lattimore and Szepesvári, 2020] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. https: //
tor-lattimore. com/ downloads/ book/ book. pdf .
[Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the
multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
[Pukelsheim, 2006] Pukelsheim, F. (2006). Optimal design of experiments. SIAM.
[Soare, 2015] Soare, M. (2015). Sequential resource allocation in linear stochastic bandits. PhD thesis, Université
Lille 1-Sciences et Technologies.
[Soare et al., 2014] Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In
Advances in Neural Information Processing Systems, pages 828–836.
[Yu et al., 2006] Yu, K., Bi, J., and Tresp, V. (2006). Active learning via transductive experimental design. In
Proceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM.

Common questions

High probability bounds are effective in adversarial settings as they provide a statistical guarantee on performance under the worst-case loss sequence, ensuring that the regret remains low with high confidence. This probabilistic assurance caters to robust algorithm design, capable of handling uncertainties and adversarial influences more reliably .

In the full information setting, the exploration-exploitation trade-off can be managed directly using the complete loss vectors, allowing precise adjustments based on observed data. In bandit settings, where only partial feedback is obtained, the trade-off is encoded through probabilistic estimates and stochastic approaches like EXP3, reflecting an indirect method of balancing exploration and exploitation amidst uncertainty .

Mirror Descent iterates adjust the decision variables by projecting the gradients of the loss function onto the feasible decision space with respect to a chosen distance-generating function. This method helps exploit the curvature of the space and can more effectively converge to an optimal solution by controlling the learning rate dynamically, which is especially useful in high-dimensional task environments .

The least squares estimator allows the algorithm to approximate the unknown parameter B8* based on the observations collected in each round. By constructing this estimator only with the data from the current round, the algorithm efficiently updates its understanding of the reward structures, thus enhancing its decision-making process for selecting arms that maximize expected reward while minimizing regret .

Exponential weights play a critical role in managing situations where the feedback available is limited to only the outcome associated with the chosen action. By assigning weights to actions based on their historical performance, the algorithm can adjust probabilities for choosing each action in future rounds. This approach enables the estimation of the outcomes for all actions despite observing only one, thus facilitating a balance between exploration and exploitation .

The primary goal of the Action Elimination Algorithm is to identify the best arm out of a set of multiple arms with high confidence. The algorithm aims to ensure that the optimal arm remains in the set of candidates throughout the iterations by eliminating arms that perform significantly worse than the current best estimate. This elimination process relies on calculating empirical means and their confidence intervals to decide whether an arm can be eliminated from further consideration .

The assumptions made regarding the distribution of rewards include the use of sub-Gaussian distributions, ensuring that variability in observations is controlled and that reward distributions do not deviate significantly from their mean. This assumption simplifies the regret analysis and calculation of confidence bounds in best-arm identification settings .

Neglecting practicalities can lead to a simplified analysis; however, it may result in algorithms that perform slightly weaker in practice. By sacrificing some performance potential for analytical simplicity, the behavior of algorithms in realistic environments might not be accurately captured, which can affect their reliability when implemented .

When B3 is set to zero, the regret bound of the bandit algorithm is O(d 94^{-1} log(|X|T)), and it implies a regret bounded by c d 94 log(|X|T). On the other hand, minimizing over B3 > 0, the regret bound becomes O(1Ap d T log(log(T/d)|X|/B4)), which implies a regret bounded by c 1Ap dT log(|X|T).

The EXP3 algorithm uses Exponential Weights for updating the probability distribution. This approach ensures that exploration is handled through random sampling based on current probability estimates, while also allowing exploitation of known rewarding actions by iteratively adjusting action probabilities .

Stochastic Bandits and Regret Analysis
No ratings yet
Stochastic Bandits and Regret Analysis
36 pages
Stoichastic Bandits
No ratings yet
Stoichastic Bandits
6 pages
Lower Bounds on Regret in Bandits
No ratings yet
Lower Bounds on Regret in Bandits
16 pages
Epsilon Greedy in Multi-Armed Bandits
No ratings yet
Epsilon Greedy in Multi-Armed Bandits
14 pages
Sub-linear Regret in Multi-armed Bandits
No ratings yet
Sub-linear Regret in Multi-armed Bandits
9 pages
Multi-Armed Bandit Strategies Explained
No ratings yet
Multi-Armed Bandit Strategies Explained
34 pages
Adaptive Exploration in Bandit Algorithms
No ratings yet
Adaptive Exploration in Bandit Algorithms
8 pages
Multi-Armed Bandit Problem Explained
No ratings yet
Multi-Armed Bandit Problem Explained
6 pages
RL Notes
No ratings yet
RL Notes
49 pages
Stochastic Bandit Models in Online Learning
No ratings yet
Stochastic Bandit Models in Online Learning
61 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
Stochastic Multi-Armed Bandits Explained
No ratings yet
Stochastic Multi-Armed Bandits Explained
15 pages
Adaptive Exploration in RL Algorithms
No ratings yet
Adaptive Exploration in RL Algorithms
4 pages
UCBRev 1
No ratings yet
UCBRev 1
11 pages
Minimizing Simple Regret in Infinite Bandits
No ratings yet
Minimizing Simple Regret in Infinite Bandits
9 pages
UCB and Thompson Sampling Algorithms
No ratings yet
UCB and Thompson Sampling Algorithms
7 pages
Lecture 2 Annotated
No ratings yet
Lecture 2 Annotated
42 pages
Exploration vs. Exploitation in RL
No ratings yet
Exploration vs. Exploitation in RL
56 pages
Minimax Optimal Linear Bandits Algorithm
No ratings yet
Minimax Optimal Linear Bandits Algorithm
20 pages
UCB1 Regret Analysis in Reinforcement Learning
No ratings yet
UCB1 Regret Analysis in Reinforcement Learning
9 pages
Contextual Bandits: Explore-Exploit Strategies
No ratings yet
Contextual Bandits: Explore-Exploit Strategies
16 pages
HW 9
No ratings yet
HW 9
4 pages
RL Unit
No ratings yet
RL Unit
595 pages
Thompson Sampling Regret in Bandits
No ratings yet
Thompson Sampling Regret in Bandits
49 pages
Understanding GAN Mathematics Intuitively
No ratings yet
Understanding GAN Mathematics Intuitively
8 pages
n-Armed Bandits in Reinforcement Learning
No ratings yet
n-Armed Bandits in Reinforcement Learning
4 pages
Kullback-Leibler UCB for Bandit Allocation
No ratings yet
Kullback-Leibler UCB for Bandit Allocation
59 pages
Subgaussian Distribution in MABs
No ratings yet
Subgaussian Distribution in MABs
51 pages
Adversarial Bandits in ML Decision-Making
No ratings yet
Adversarial Bandits in ML Decision-Making
12 pages
Bandit Algorithms and Statistical Robustness
No ratings yet
Bandit Algorithms and Statistical Robustness
16 pages
Reinforcement Learning Lecture Overview
No ratings yet
Reinforcement Learning Lecture Overview
51 pages
Infinite Multi-Armed Bandit Algorithms
No ratings yet
Infinite Multi-Armed Bandit Algorithms
8 pages
Optimal Multi-Objective Arm Identification
No ratings yet
Optimal Multi-Objective Arm Identification
72 pages
MLT Homework
No ratings yet
MLT Homework
5 pages
Sequential Learning of The Pareto Front For Multi-Objective Bandits
No ratings yet
Sequential Learning of The Pareto Front For Multi-Objective Bandits
19 pages
Generalization in Online Convex Algorithms
No ratings yet
Generalization in Online Convex Algorithms
8 pages
CS 189 Final Exam Cheat Sheet
No ratings yet
CS 189 Final Exam Cheat Sheet
2 pages
Non-Stochastic Best Arm Identification
No ratings yet
Non-Stochastic Best Arm Identification
13 pages
pdf24 Images Merged
No ratings yet
pdf24 Images Merged
12 pages
Typical Regret Behavior in Bandit Algorithms
No ratings yet
Typical Regret Behavior in Bandit Algorithms
32 pages
NIPS 2016 Coin Betting and Parameter Free Online Learning Paper
No ratings yet
NIPS 2016 Coin Betting and Parameter Free Online Learning Paper
9 pages
Chapter 02
No ratings yet
Chapter 02
9 pages
PAC Bounds for Multi-Armed Bandits
No ratings yet
PAC Bounds for Multi-Armed Bandits
10 pages
Online Prediction Algorithms Overview
No ratings yet
Online Prediction Algorithms Overview
8 pages
CS5180 HW2 Yueren Xing
No ratings yet
CS5180 HW2 Yueren Xing
7 pages
Regret Bounds for Thompson Sampling
No ratings yet
Regret Bounds for Thompson Sampling
3 pages
UCB Algorithm in Reinforcement Learning
No ratings yet
UCB Algorithm in Reinforcement Learning
6 pages
Multi-Armed Bandits in Reinforcement Learning
No ratings yet
Multi-Armed Bandits in Reinforcement Learning
8 pages
Multi-Arm-Bandit Problem
No ratings yet
Multi-Arm-Bandit Problem
11 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
6 pages
Clus-UCB: Optimal Clustered Bandit Algorithm
No ratings yet
Clus-UCB: Optimal Clustered Bandit Algorithm
17 pages
CS 747 Weekly Quiz Overview
No ratings yet
CS 747 Weekly Quiz Overview
5 pages
Multi-Armed Bandit Problem Explained
No ratings yet
Multi-Armed Bandit Problem Explained
9 pages
Optimizing Exploration in Bandits Algorithm
No ratings yet
Optimizing Exploration in Bandits Algorithm
19 pages
Module1 B
No ratings yet
Module1 B
25 pages
Batched Multi-Armed Bandits Analysis
No ratings yet
Batched Multi-Armed Bandits Analysis
11 pages
PR 3
No ratings yet
PR 3
10 pages
Introduction to Algorithm Design
No ratings yet
Introduction to Algorithm Design
19 pages
Quantum Computing: From Bits to Qubits
No ratings yet
Quantum Computing: From Bits to Qubits
33 pages
Path Analysis: Direct & Indirect Effects
No ratings yet
Path Analysis: Direct & Indirect Effects
85 pages
Cost Analysis of Christmas Hampers Using Matrices
No ratings yet
Cost Analysis of Christmas Hampers Using Matrices
7 pages
Overview of Information Theory Concepts
No ratings yet
Overview of Information Theory Concepts
26 pages
Traffic Counting with OpenCV & Python
No ratings yet
Traffic Counting with OpenCV & Python
11 pages
Proto-RL: Efficient Representation Learning
No ratings yet
Proto-RL: Efficient Representation Learning
21 pages
AI Half-Yearly Exam for Grade X
No ratings yet
AI Half-Yearly Exam for Grade X
7 pages
CGLAB 3804: Window-Viewport Clipping
No ratings yet
CGLAB 3804: Window-Viewport Clipping
9 pages
Round Robin CPU Scheduling Implementation
No ratings yet
Round Robin CPU Scheduling Implementation
16 pages
Testbank Operations and Supply Chain Management 2nd Edition David Alan Collier James R Evans Ebook Solutions
No ratings yet
Testbank Operations and Supply Chain Management 2nd Edition David Alan Collier James R Evans Ebook Solutions
240 pages
SVM Basics for Supervised Learning
No ratings yet
SVM Basics for Supervised Learning
29 pages
Amazon Customer Behavior Analysis
No ratings yet
Amazon Customer Behavior Analysis
3 pages
Dynamic Programming in Ada
No ratings yet
Dynamic Programming in Ada
11 pages
Knapsack Problem and Algorithms Explained
No ratings yet
Knapsack Problem and Algorithms Explained
39 pages
MATLAB Curve Fitting with Least Squares
No ratings yet
MATLAB Curve Fitting with Least Squares
27 pages
STAT 3008 Regression Assignment 2
No ratings yet
STAT 3008 Regression Assignment 2
2 pages
Queueing Theory Performance Metrics
No ratings yet
Queueing Theory Performance Metrics
5 pages
S-DES Implementation in Python
No ratings yet
S-DES Implementation in Python
7 pages
Car Sorting Method with Workers
No ratings yet
Car Sorting Method with Workers
2 pages
Urban Flood Modeling with Deep Learning
No ratings yet
Urban Flood Modeling with Deep Learning
13 pages
Inverse Functions and Characteristics Review
No ratings yet
Inverse Functions and Characteristics Review
2 pages
Text Processing Exam Questions COM3110
No ratings yet
Text Processing Exam Questions COM3110
5 pages
Applications of the Bellman Equation
No ratings yet
Applications of the Bellman Equation
3 pages
Data Structures and Algorithms Overview
No ratings yet
Data Structures and Algorithms Overview
50 pages
Understanding Supermodularity in Economics
No ratings yet
Understanding Supermodularity in Economics
3 pages
Floating Point Representation Overview
No ratings yet
Floating Point Representation Overview
25 pages
M.Sc. Physics Exam: Classical Mechanics & Relativity
No ratings yet
M.Sc. Physics Exam: Classical Mechanics & Relativity
2 pages
Linear Programming in Management Science
No ratings yet
Linear Programming in Management Science
2 pages
Revision Checklist for Engineering Topics
No ratings yet
Revision Checklist for Engineering Topics
4 pages

Understanding Multi-armed Bandits Concepts

Uploaded by

Understanding Multi-armed Bandits Concepts

Uploaded by

Some Notes on Multi-armed Bandits

Kevin Jamieson, University of Washington

1.1 Regret Minimization

The goal is to have R(T ) = o(T ) to achieve sub-linear regret.

Thus, we want to minimize the number of times we play sub-optimal arms.

for the chosen τ .

1.4 Disclaimers and goals of these lectures

2 Action Elimination Algorithm for Multi-armed Bandits

In what follows assume E holds.

θbj,` − θb1,` = (θbj,` − θj∗ ) − (θb1,` − θ1∗ ) − ∆`

max θbj,` − θbi,` ≥ θb1,` − θbi,`

which implies this maxj∈Xb`+1 θj∗ ≥ θ1∗ − 4` = θ1∗ − 8`+1 .

Theorem 2. Assume that maxi∈X ∆i ≤ 4. For any T ∈ N, with probability at least 1 − δ

3 Lower bounds for Multi-armed Bandits

3.1 Mean of a Gaussian

and (integrating only over support of pq)

and that KL(N (0, 1)|N (∆, 1)) = ∆2 /2.

3.3 Regret, minimax

3.4 Gap-dependent regret

3.5 Revisiting MAB with Optimism

4 Linear Bandits Intro

4.1 Least Squares

z > (θ̂ − θ∗ ) ∼ N (0, z > (X > X)−1 z).

4.1.1 Aside: Gaussian to sub-Gaussian

P (xT (θ̂ − θ∗ ) > µ) = P (wT η > µ)

||w||22 = xT (X T X)−1 X T X(X T X)−1 x = xT (X T X)−1 x.

Thus with probability at least 1 − δ,

• A-optimality: minimize fA (λ) = Tr(A−1 2

• E-optimality: minimize fE (λ) = maxu:kuk≤1 u> A−1 2

• D-optimality: maximize gD (λ) = log(|Aλ |) maximizes the entropy of distribution. Also, if Eλ = {x :

• G-optimality: minimize fG (λ) = maxx∈X x> A−1 ∗ 2

• Leverage scores if V -optimality

6 Linear Bandits: Regret Minimization

Ex,` (V) = {|hx, θb` − θ∗ i| ≤ ` }

hx − x? , θb` i = hx, θb` − θ∗ i − hx? , θb` − θ∗ i + hx − x? , θ∗ i

max hx0 − x, θb` i ≥ hx? − x, θb` i

= hx? , θb` − θ∗ i − hx, θb` − θ∗ i + hx? − x, θ∗ i

which implies maxx∈Xb`+1 hx, θ∗ i ≥ hx? , θ∗ i − 4` = hx? , θ∗ i − 8`+1 .

• Optimism won’t help here

7 Linear Bandits: Pure exploration

kx? − xk2(P τ λx xx> )−1 1

Once can prove a lower bound of log(1/2.4δ)ρ? .

Set ` = 2−` , τ` = 2−2 2

Proof. For any V ⊆ X and x ∈ V define

Ex,` (V) = {|hx − x? , θb` − θ∗ i| ≤ ` }

hx0 − x? , θb` i = hx0 − x? , θb` − θ∗ i + hx0 − x? , θ∗ i

max hx0 − x, θb` i ≥ hx? − x, θb` i

= hx? − x, θb` − θ∗ i + hx? − x, θ∗ i

τ ≤ cρ? log(∆−1 )[log(1/δ) + log(log(∆−1 )) + log(|X |)].

f (Xb` ) = min max kx − x0 k2(P λx xx> )−1

dlog2 (4∆−1 )e dlog2 (4∆−1 )e  

We now note that

λ∈4X x∈X (hx − x? , θ∗ i)2

where we have used the fact that maxx,x0 ∈St kx − x0 k2(P ? 2P

8 Linear bandits: regret minimization revisited

9 Adversarial Linear Bandits

at+1 = ∇F ∗ (∇F (at ) − η∇`(at , zt ))

where we have assumed the iterates exist.

Online Mirror Descent with Linear Losses

at+1 = ∇F ∗ (∇F (at ) − ηzt )

Stochastic Online Mirror Descent with Linear Losses

at+1 = ∇F ∗ (∇F (at ) − ηb

Proof. Applying Corollary 1 with zbt we have

9.2 Simplex with unnormalized negative entropy

[∇F (x)]i = log(xi )

9.2.2 Full information, finite action set

Exponential Weights over finite actions

9.2.3 Bandit feedback over finite action sets

EXP3: Exponential Weights for Exploration Exploitation

9.3 Other action sets

9.3.1 Unit ball action set

9.3.2 Finite action sets

EXP3(γ): Exponential Weights for Exploration Exploitation

where φ(x) = ex − 1 − x ≤ x2 for |x| ≤ 1.

• yt,i = x> bt,i = x>

First note that

which implies this maxj∈Xb`+1 θj∗ ≥ θ1∗ − 4` = θ1∗ − 8`+1 .

Ex,` (V) = {|hx, θb` − θ∗ i| ≤ ` }

which implies maxx∈Xb`+1 hx, θ∗ i ≥ hx? , θ∗ i − 4` = hx? , θ∗ i − 8`+1 .

Set ` = 2−` , τ` = 2−2 2

Ex,` (V) = {|hx − x? , θb` − θ∗ i| ≤ ` }

dlog2 (4∆−1 )e dlog2 (4∆−1 )e

• Player chooses xt ∈ X and observes yt = v(ct , xt ) + t