0% found this document useful (0 votes)
9 views29 pages

Understanding Multi-armed Bandits Concepts

This is notes from a class

Uploaded by

natedaeila
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

Understanding Multi-armed Bandits Concepts

This is notes from a class

Uploaded by

natedaeila
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Some Notes on Multi-armed Bandits

Kevin Jamieson, University of Washington

These notes were written for myself to refer to while lecturing. They are not a replacement for
the course textbooks [Lattimore and Szepesvári, 2020, Bubeck et al., 2012] and may contain
errors! I have posted them by request.

1 Introduction
Machine learning, and in particular, supervised learning, is the study of making statistical inferences from
previously collected data. Multi-armed bandits is more about an interaction between an agent (algorithm)
and an environment where one simultaneously collects data and makes inferences in a closed-loop.
You have n “arms” or actions, representing distributions. “Pulling” an arm represents requesting a
sample from that arm.
At each time t = 1, 2, 3, . . .
• Algorithm chooses an action It ∈ {1, . . . , n}
• Observes a reward XIt ,t ∼ PIt where P1 , . . . , Pn are unknown distributions
That is, playing arm i and time s results in a reward Xi,s from the ith distribution. In these lectures, all
distributions will be Gaussian (or sub-Gaussian) with variance 1 unless otherwise specified. Example of
sub-Gaussian distribution is bounded distributions on [−1, 1] or Gaussian N (0, 1). Formally, a distribution
of X is 1-sub-Gaussian if E[exp(λX)] ≤ exp(λ2 /2).
We will find that the means of the distribution are the most pertinent parameters of these distributions.
Let θi∗ = EX∼Pi [X] be the mean of the ith distribution. Define ∆i = maxj=1,...,n θj∗ − θi∗ . We measure
performance of an algorithm in two ways: 1) how much total reward is accumulated, and 2) how many total
pulls are required to identify the best mean.

1.1 Regret Minimization


After T time steps, define the regret as
" T
#
X
RT = max θj∗ T − E XIt ,t
j=1,...,n
t=1

The goal is to have R(T ) = o(T ) to achieve sub-linear regret.


If at time T the ith arm has been played Ti times, then
" T #
X

RT = max θj T − E XIt ,t
j=1,...,n
t=1
T
" n #
X X
= max θj∗ T − E Xi,t 1{It = i}
j=1,...,n
t=1 i=1
n
" T #
X X
= max θj∗ T − θi∗ E 1{It = i}
j=1,...,n
i=1 t=1
Xn
= max θj∗ T − θi∗ E [Ti ]
j=1,...,n
i=1
n
X
= ∆i E [Ti ]
i=1

Thus, we want to minimize the number of times we play sub-optimal arms.

1
1.2 Best-arm identification
Given a δ ∈ (0, 1) identify the best arm with probability at least 1 − δ using as few total pulls as possible.
While related, these objectives are at odds with one another. Sometimes called the (, δ)-PAC setting,
but for simplicity we’ll take  = 0.

1.3 Warm-up
Suppose n = 2. How long would it take to decide one arm was better than another using sub-gaussian
bounds? Consider the trivial algorithm:
Input: 2 arms, time τ ∈ N.
Pull each arm i ∈ {1, 2} exactly τ times and compute empirical mean θbi .
For all t > 2τ play arm arg maxi θbi
Proposition 1. Fix , δ. If Z1 , Z2 , . . . are independent mean-zero σ 2 -sub-Gaussian
Pτ random variables so that
E[exp(λZt )] ≤ exp(λ2 σ 2 /2), then for τ = d2σ 2 −2 log(1/δ)e we have P( τ1 t=1 Zt ≤ ) ≥ 1 − δ.
Proof.
τ τ
!
1X X
P( Zt > ) = P(exp λ Zt > exp(λτ ))
τ t=1 t=1
" τ
!#
X
≤ e−λτ  E exp λ Zt
t=1
τ
Y
= e−λτ  E [exp (λZt )]
t=1
= e−λτ  exp λ2 σ 2 τ /2


= exp −λτ  + λ2 σ 2 τ /2


≤ exp −τ 2 /2σ 2 ≤ δ


for the chosen τ .



Set τ = d8∆−2 log(4/δ)e and let θbi = 1
τ s=1 Xi,s for i = 1, 2. Define the event
( r )
∗ 2 log(4/δ)
Ei := |θbi − θi | ≤ .
τ

Then P(E1c ∪ E2c ) ≤ P(E1c ) + P(E1c ) ≤ δ. Thus, if we pull each arm τ times then on E1 ∩ E2 we have
θb1 > θ1∗ − ∆/2
≥ θ2∗ + ∆/2
> θb2
so that we have determined the best-arm. And we can play it forever.
After any T total plays such that arm i has been played Ti times and T = T1 + T2 , the expected regret
is at most
" T #
X

θ1 T − E XIs ,s = θ1∗ T − E [(T1 θ1∗ + T2 θ2∗ )]
s=1
= E [T2 ∆]
= E [T2 ∆1{E1 ∩ E2 } + T2 ∆1{E1c ∪ E2c }]
≤ E [τ ∆1{E1 ∩ E2 } + T ∆1{E1c ∪ E2c }]
≤ 8∆−1 log(4/δ) + ∆T P(E1c ∪ E2c )
≤ 8∆−1 log(4/δ) + ∆T δ.

2
If we take δ = 1/T then the expected regret is less than ∆ + 8∆−1 log(4T ). On the other hand, the regret
can’t possibly be greater than ∆T , thus the total regret is bounded by
" T #
X

θ1 T − E XIs ,s = min{T ∆, ∆ + 8∆−1 log(4T )}
s=1
p
≤1+2 8T log(4T )
p
where the last step takes the worst case ∆ = 8 log(4T )/T .
Takeaway: For√ very small ∆ we lose almost nothing, for very large ∆ its easy to distinguish, its maximized
at around 1/ T . We’ll see this again.

1.4 Disclaimers and goals of these lectures


To goal of these lectures is to provide an overview of the kinds of problems multi-armed bandits can solve,
and basic algorithmic tropes. By the end of these lectures I hope you will be able to model a real-world
problem as an appropriate bandit problem and propose an algorithm that is a strong baseline.
As a consequence, I will not be stressing important practicalities of algorithms that muck up the analysis.
That is, I will pay for a slightly weaker performing algorithm in exchange for a dramatically simpler analysis.
This means that I will i) assume the horizon times T for regret analyses is known in advance, ii) I will take
liberties with constants and some log factors, making notes of the best known results when pertinent, iii)
will always assume there is a uniquely optimal arm for best-arm identification and aim to find it, not merely
an -good arm, iv) I will only study best-arm identification in the fixed confidence setting (exists a whole
literature on fixed budget setting), v) will assume sub-Gaussian distributions everywhere and nothing more.

2 Action Elimination Algorithm for Multi-armed Bandits


Input: n arms X = {1, . . . , n}, confidence level δ ∈ (0, 1).
Let Xb1 ← X , t ← 1
while |Xb` | > 1 do
t = 2−t
4`2 |X |
Pull each arm in Xb` exactly τ` = d2−2 ` log( δ )e times
Compute the empirical mean of these rewards θbi,` for all i ∈ Xb`

Xb`+1 ← Xb` \ i ∈ Xb` : maxj∈Xb` θbj,` − θbi,` > 2`
t←t+1
Output: Xbt+1

Lemma 1. Assume that maxi∈X ∆i ≤ 4. With probability at least 1−δ, we have 1 ∈ Xb` and maxi∈Xb` ∆i ≤ 8`
for all ` ∈ N.
Proof. For any ` ∈ N and i ∈ [n] define
n o
Ei,` = |θbi,` − θi∗ | ≤ `
T n T∞ q 2 /δ)
and E = i=1 `=1 Ei,` . Noting that ` = 2 log(4n`
τ` we have
∞ ∞
n [
! n X
[ X δ
P(E c ) = P c
Ei,` ≤ ≤ δ.
i=1 `=1 i=1 `=1
2n`2

In what follows assume E holds.


Fix any ` for which 1 ∈ Xb` (note 1 ∈ Xb1 ). Then for any j ∈ Xb` we have

θbj,` − θb1,` = (θbj,` − θj∗ ) − (θb1,` − θ1∗ ) − ∆`


E
≤ 2`

3
which implies 1 ∈ Xb`+1 . Thus, 1 ∈ Xb` for all `. On the other hand, any i for which ∆i = θ1∗ − θi∗ > 4` we
have

max θbj,` − θbi,` ≥ θb1,` − θbi,`


j∈X
b`

= (θbj,` − θj ) − (θbi,` − θi ) − ∆i
> 2` − 4` = 2`

which implies this maxj∈Xb`+1 θj∗ ≥ θ1∗ − 4` = θ1∗ − 8`+1 .

Theorem 1. Assume that maxi∈X ∆i ≤ 4. Then with probability at least 1 − δ, 1 is returned from the
algorithm at a time τ that satisfies
n
X
τ ≤c ∆−2 −2
i log(n log(∆i )/δ)
i=2

Proof. If ∆ = mini6=1 ∆i then Xb` = {1} for t ≥ dlog2 (8∆)e. Note that

dlog2 (8∆)e
X
Ti = τ` 1{i ∈ Xb` }
`=1
dlog2 (8∆)e
X
≤ τ` 1{∆i ≤ 8` }
`=1
dlog2 (8∆−1
i )e
X
= τ`
`=1
dlog2 (8∆−1
i )e
2
4` |X |
X
= d2−2
` log( δ )e
`=1
dlog2 (8∆−1
i )e
4 log22 (16∆−2
i )|X |
X
≤ d2 log( δ )e 4`
`=1
4 log22 (16∆−2
i )|X |
≤ c∆i−2 log( δ )

which implies
Thus, the total number of samples taken before Xb` = {1} is equal to
n n
4 log22 (16∆−2
i )|X |
X X
Ti ≤ c∆−2
i log( δ )
i=1 i=1
Pn
which implies that one can identify the best arm after no more than i=2 ∆−2 −2
i log(n log(∆i )/δ).

Theorem 2. Assume that maxi∈X ∆i ≤ 4. For any T ∈ N, with probability at least 1 − δ


n
−1
c(∆i ∨ ν)−1 log( log((∆i ∨ν) )|X |
X X
Ti ∆i ≤ inf νT + δ ).
ν≥0
i:∆i >0 i=1

Pn
∆−1
p
Moreover, if the algorithm is run with δ = 1/T then RT ≤ c i=2 i log(T ) and RT ≤ c nT log(T ).

4
Suppose you run for T timesteps. Then for any ν ≥ 0 the regret is bounded by:
n
X X
∆i Ti ≤ νT + 8` τ` |Xb` |
i=2 `:8` >ν
dlog2 (8(∆∨ν)−1 )e
X
≤ νT + 8` τ` |Xb` |
`=1
−1
n dlog2 (8(∆∨ν)
X X )e
≤ νT + 8` τ` 1{∆i ≤ 8` }
i=1 `=1
−1
n dlog2 (8(∆
X X i ∨ν) )e
= νT + 8` τ`
i=1 `=1
−1
n dlog2 (8(∆ i ∨ν) )e
2
4` |X |
X X
= νT + 8` d2−2
` log( δ )e
i=1 `=1
n dlog2 (8(∆i ∨ν)−1 )e
4 log22 (8(∆i ∨ν)−2 )|X |
X X
≤ νT + c log( δ ) 2`
i=1 `=1
n
−1
c(∆i ∨ ν)−1 log( log((∆i ∨ν) )|X |
X
≤ νT + δ )
i=1
Pn −1 −1
Setting ν = 0 yields a regret of i=2p ∆i log(n log(∆i )/δ). On the other hand, using ∆i ∨ ν ≥ ν and
minimizing over ν yields a regret of nT log(n log(T )/δ). The expected regret, of course, is then bounded
by
n
" n #
X X
∆i E[Ti ] = E ∆i Ti
i=2 i=2
n
X
≤ ∆−1 −1 c
i log(n log(∆i )/δ) + T P(E )
i=2
Pn
Setting δ = 1/T implies the regret is less than i=2 c∆−1
i log(T ).
Some remarks:
• This analysis doesn’t reuse samples from previous rounds, it is easy to make this change.
• Regret bound requires knowledge of T a priori.

3 Lower bounds for Multi-armed Bandits


Let us briefly pause to consider how far off from optimal we are, and then think about an algorithm that
could get us to optimality. How do we know we’re doing okay?

3.1 Mean of a Gaussian


Suppose
Pn I get n samples from a Gaussian
p distribution N (µ, 1). You compute the empirical mean µ b =
1
n i=1 X i . We know that |b
µ − µ| ≤ 2 log(2/δ)/n. How tight is this? If µ ∈ {0, ∆} then we just need
n = 8∆−2 log(2/δ)1
1 Using the SPRT, as δ → 0 one needs just an expected number of samples equal to 2∆−2 log(2/δ).

5
2 2
1 −(x−µ) /2σ
Let pµ (x) = 2π e be the Gaussian distribution with mean µ. Under H0 , Xi ∼ p0 and under
H1 , Xi ∼ p∆ . Let φ : Rn → {0, ∆}. Then the minimax probability of error is equal to
1
inf max{P0 (φ = 1), P1 (φ = 0)} ≥ inf (P0 (φ = 1), P1 (φ = 0))
φ φ 2
Z Z 
1
= inf 1{φ(x) = 1}p0 (x)dx + 1{φ(x) = 0}p1 (x)dx
φ 2 x∈Rn x∈Rn
Z
1
= min{p0 (x), p1 (x)}dx
2 x∈Rn
Z 2
1 p
≥ p0 (x)p1 (x)dx (Cauchy-Schwartz)
4 x∈Rn
 Z 
1
≥ exp − log( pp10 (x) )p
(x) 1 (x)dx (Jensen’s)
4 x∈Rn

where
Z 2 Z 2
p p
p0 (x)p1 (x)dx = min{p0 (x), p1 (x)} max{p0 (x), p1 (x)}dx
x∈Rn n
Z x∈R Z
≤ min{p0 (x), p1 (x)}dx max{p0 (x), p1 (x)}dx (Cauchy-Schwartz)
x∈Rn x∈Rn
Z
≤2 min{p0 (x), p1 (x)}dx
x∈Rn

and (integrating only over support of pq)


Z 2  Z 
p p
p0 (x)p1 (x)dx = exp 2 log( p0 (x) p1 (x)/p0 (x)dx)
x∈Rn x∈Rn
 Z 
p
≥ exp 2 p0 (x) log( p1 (x)/p0 (x))dx
n
 Zx∈R 
= exp − log( pp01 (x)
(x)
)p1 (x)dx
x∈Rn

Note that
Z n
! n
Y p1 (xi ) Y
KL(P1 |P0 ) = log p1 (xi )dx
x i=1
p0 (xi ) i=1
= nKL(p1 |p0 ) = n∆2 /2

and that KL(N (0, 1)|N (∆, 1)) = ∆2 /2.


We conclude that
1
exp −n∆2 /2

inf max{P0 (φ = 1), P1 (φ = 0)} ≥
φ 4
Thus, to determine whether or not n samples are from a Gaussian with mean 0 or ∆ with probability of
failure less than δ, one needs n ≥ 2∆−2 log(1/4δ).

3.2 Indentification
An algorithm for best-arm identification at time t is described by given a history (Is , Xs )s<t for each time t
is described by a
• selection rule It ∈ [n] is Ft−1 measurable where Ft = σ(I1 , X1 , I2 , X2 , . . . , It−1 , Xt−1 )
• stopping time τ is Ft measurable, and

6
• recommendation rule bi ∈ [n] invoked at time τ which is Fτ -measurable.
Definition 1. We say that an algorithm for best-arm identification is δ-PAC if for all θ∗ ∈ Rn we have
Pθ∗ (bi = arg maxi∈[n] θi∗ ) ≥ 1 − δ.
The following is due to [Kaufmann et al., 2016], a strengthening of the first time it appeared in [Mannor and Tsitsiklis, 2004
Theorem 3 (Best-arm identification lower bound). Any algorithm Pnthat is δ-PAC on {P : Pi = N (θi , 1), θ1 >
1
maxi6=1 θi , θ ∈ [0, 1]n } for δ < 0.15 satisfies Eθ∗ [τ ] ≥ 2 log( 2.4δ ) i=1 ∆−2
i .

Proof sketch: The original instance has Pi = N (θi∗ , 1). Pick some j ∈ [n] and define an alternative mean
(j) (j)
vector θ(j) ∈ [0, 1]n such that θi = θi∗ if i 6= j and θi = θ1 +  for j = i for some arbitrarily small number
(j)
. Note that under θ , arm j is the best arm.
Because the algorithm claims to be δ-PAC, it has to output arm 1 under θ∗ and arm j under θ(j) . But
these two bandit games only differ on arm j so to tell the difference between them its only natural to sample
arm j until one can figure out which instance is being played (i.e., is its mean θj or θ1 + ?) The discussion
above suggests that to make this distinction with probability at least 1 − δ, it is necessary to sample arm j
at least 2(θ1 − θj + )−2 log(1/4δ) times. Taking  to zero and noticing that j was arbitrary completes the
sketch.
This is not a proof, however, because the number of times the algorithm samples arm j is random whereas
in the above argument it was fixed. The proof of [Kaufmann et al., 2016] provides convenient tools to prove
general lower bounds for δ-PAC settings.

3.3 Regret, minimax


Theorem p4 (Minimax regret lower bound). For every T ≥ n there exists an instance P = N (θ∗ , I) such
that RT ≥ (n − 1)T /27.
Proof sketch: Let θ∗ = θ = (∆, 0, . . . , 0). For any algorithm, by the pigeon hole principle, there exists an
arm bi ∈ [n] such that E[Tbi ] ≤ T /n.
Define an alternative Gaussian instance with mean vector θ0 that is identical to θ other than θbi = 2∆.
p
If ∆ ≈ n/T then bi will not be given enough samples to distinguish between the two instances, which
means E[T1 ] will be about the same under both models. √
Under θ, if E[T1 ] ≤ T /2 then the regret incurred is at√least ∆T /2 ≈ nT . On the other hand, under θ0 ,
if E[T1 ] > T /2 then the regret again is at least ∆T /2 ≈ nT .
This is not a proof because again the number of times an arm is pulled is random, but as before, these
arguments can be made precise.

3.4 Gap-dependent regret


Lemma 2. Any strategy that satisfies E[Ti (t)] = o(ta ) for any arm i with ∆i > 0 and a ∈ (0, 1), we have
R̄T Pn 2
that limT →∞ inf log(T ) = i=2 ∆i .

Takeaway: This is what his field does: prove an initial upper, then lower, then chase it.

3.5 Revisiting MAB with Optimism


Why go beyond action elimination algorithms? Because they will never hit the asymptotic lower bound, for
one thing, since if we look at when theqsecond to last arm exits, the lowerbounds are the same.
α-UCB which is arg maxi θbi,T (t) + 2α log(t) as α → 1 achieves the lower bound.
i Ti (t)
b1 ≈ µ1 . Minimizing
Any sub-linear regret algorithm plays arm 1 an infinite number of times, so assume µ
the maximum upper bound. Thus, we expect the number of times the ith arm is pulled is 2∆−2 i log(T ),
which is optimal.
UCB1 in its most popular
√ form was developed by [Auer et al., 2002].
MOSS first achieved nT regret [Audibert and Bubeck, 2009].
KL-UCB is finite-time analysis with optimal constants for asymptotic regret [Cappé et al., 2013].

7
The recent work of [Lattimore, 2018] defined a UCB-based algorithm that achieves asymptotic optimal

constants, and finite regret bounds of i log(T )
P
∆ −1 and nT .
i

4 Linear Bandits Intro


Now suppose each arm i = 1, . . . , n has a feature vectors xi ∈ Rd . And more over, there exists some θ∗ ∈ Rd
such that a pull of arm It ∈ [n] results in a reward yt = hxIt , θ∗ i + ηt where ηt ∼ N (0, 1).
Applications: Drug-discovery, Spotify, Netflix, ads
In the previous setup, pulling arm i provided no information about arm j, but now suddenly it does.

4.1 Least Squares


Given a sequence of arm choices and observed rewards let {xt , yt , ηt }τt=1 we denote the stacked sequences of
each as X ∈ Rτ ×d , Y ∈ Rτ , and η ∈ Rτ respectively where Y = Xθ∗ + η. Using this information we can
derive a least-squares estimate of θ∗ given as follows
θ̂ = (X T X)−1 X T Y = (X T X)−1 X T (Xθ∗ + η) = θ∗ + (X T X)−1 X T η.
Fix any z ∈ Rd , then Thus
z > (θ̂ − θ∗ ) = z > (X > X)−1 X > η.

Note that η ∼ N (0, I). For any W ∼ N (µ, Σ) we have AW + b ∼ N (Aµ + b, AΣA> ). Thus

z > (θ̂ − θ∗ ) ∼ N (0, z > (X > X)−1 z).


so that
 q 
>
P z (θ̂ − θ∗ ) ≥ 2z > (X > X)−1 z log(1/δ) ≤ δ.

We will use the notation kzk2A = z > Az so that with probability at least 1 − δ
p
z > (θ̂ − θ∗ ) ≤ kzk(X > X)−1 2 log(1/δ)

4.1.1 Aside: Gaussian to sub-Gaussian


For an arbitrary constant µ,

P (xT (θ̂ − θ∗ ) > µ) = P (wT η > µ)


≤ exp(−λµ)E[exp(λwT η)], let λ > 0 Chernoff Bound
t
X
= exp(−λµ)E[exp(λ wi ηi )]
i=1
t
Y
= exp(−λµ) E[exp(λwi ηi )] independence of wi ηi
i=1
Yt
≤ exp(−λµ) exp(λ2 wi2 /2) sub-Gaussian assumption
i=1
λ2
= exp(−λµ) exp( ||w||22 )
2
µ2 µ
≤ exp(− ) λ=
2||w||22 ||w||22
µ2
= exp(− T T −1 ) = δ,
2x (X X) x

8
where in the final step we made use of the following equality

||w||22 = xT (X T X)−1 X T X(X T X)−1 x = xT (X T X)−1 x.

Thus with probability at least 1 − δ,


r
T 1
x (θ̂ − θ∗ ) ≤ 2xT (X > X)−1 x log( )
δ
p
=: kxk(X > X)−1 2 log(1/δ)

5 Experimental design
Note that if I take measurements (x1 , . . . , xn ) ∈ X and observe their corresponding observations yi =
hxi , θ∗ i + ηi where ηi ∈ 0, ∞, then E[(θb − θ)(θb − θ)> ] = σ 2 (X T X)−1 and also, θb − θ∗ ∼ N (0, σ 2 (X > X)−1 ).
We can visualize this as a confidence ellipsoid for each choice of X. And we can even think of optimizing
1 −x> Σ−1 x/2
the choice. Recall that the PDF of a Gaussian is φ(x) = (2π|Σ|) d/2 e . With entropy 12 log(2πe|Σ|).
When the number of selected points is large, its more convenient to think of sampling n points from a
distribution placed over X . Define
X
Aλ = λx xx>
x∈X

so that for every X ∈ Rτ ×d there exists some λ ∈ 4X such that X > X = >
P
x∈X dλx τ exx = Aλ . This Aλ
can then be used to shape the covariance θ:
b

• A-optimality: minimize fA (λ) = Tr(A−1 2


λ ) minimizes E[kθ − θk2 ]
b

• E-optimality: minimize fE (λ) = maxu:kuk≤1 u> A−1 2


λ u minimizes maxu:kuk≤1 E[(hu, θ − θi) ]
b

• D-optimality: maximize gD (λ) = log(|Aλ |) maximizes the entropy of distribution. Also, if Eλ = {x :


x> A−1
λ x ≤ d} then D-optimality is the minimum volume ellipsoid that contains X .

• G-optimality: minimize fG (λ) = maxx∈X x> A−1 ∗ 2


λ x minimizes maxx∈X E[(hx, θ − θ i) ]
b

Lemma 3 (Kiefer-Wolfowitz (1960)). For any X with d = dim(span(X )), there exists a λ∗ ∈ 4X that
• maxλ gD (λ) = gD (λ∗ )
• minλ fG (λ) = fG (λ∗ )
• fG (λ∗ ) = gD (λ∗ ) = d

• support(λ∗ ) = (d + 1)d/2
Proposition 2. If λ∗ is the G-optimal design for X then if we pull arm x ∈ X exactly dτ λ∗x e times for some
τ > 0 and compute the least squares estimator θ. b Then for each x ∈ X we have with probability at least 1 − δ
p
hx, θb − θ∗ i ≤ kxk(Px∈X dτ λ∗x exx> )−1 2 log(1/δ)
1 p
≤ √ kxk(Px∈X λ∗x xx> )−1 2 log(1/δ)
τ
r
2d log(1/δ)

τ

and we have taken at most τ + d(d+1) pulls. Thus, for any δ 0 ∈ (0, 1) we have P( x∈X {|hx, θb − θ∗ i| >
S
q 2
2d log(2|X |/δ 0 )
τ }) ≤ δ 0 .

9
Notes:
• The support size of (d + 1)d/2 is trivial application of Caratheodory’s theorem. Many algorithms to
find this efficiently.
• Note that one can find a λ∗ with a constant approximation with just support O(d).

• Leverage scores if V -optimality


• John’s ellipsoid is equivalent to G/D-optimality
[Pukelsheim, 2006, Yu et al., 2006]. [Yu et al., 2006, Soare et al., 2014, Soare, 2015, Lattimore and Szepesvari, 2017],

6 Linear Bandits: Regret Minimization


This section is inspired by [Lattimore and Szepesvári, 2020].
Input: Finite set X ⊂ Rd , confidence level δ ∈ (0, 1).
Let Xb1 ← X , ` ← 1
while |Xb` | > 1 do
Let λb` ∈ 4 b be a d(d+1) -sparse minimizer of f (λ) = max kxk2P
X` 2 ( x∈X
λx xx> )−1
x∈X `
b` c

` = 2−` , τ` = 2d−2 2
` log(4` |X |/δ)
Pull arm x ∈ X exactly dλ b`,x τ` e times and construct the least squares estimator θb` using only
the observations  of this round
Xb`+1 ← Xb` \ x ∈ Xb` : maxx0 ∈Xb` hx0 − x, θb` i > 2`
`←`+1
Output: Xb`
After T time steps, define the regret as
" T #
X
RT = hx? , θ∗ i − E hxt , θ∗ i
t=1
 
X
= E Tx ∆x 
x6=x?

where ∆x = hx? − x, θ∗ i.

Lemma 4. Assume that maxx∈X hx? − x, θ∗ i ≤ 4. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 8` for all ` ∈ N.
Proof. For any V ⊆ X and x ∈ V define

Ex,` (V) = {|hx, θb` − θ∗ i| ≤ ` }

where it is implicit that θb` is the G-optimal design constructed in the algorithm at stage ` with respect
to Xb` = V. Note that this is precisely the analogous events of multi-armed bandits. The key piece of the

10
analysis is that
   
∞ [
[ ∞
X [
c c
P {Ex,` (Xb` )} ≤ P {Ex,` (Xb` )}
`=1 x∈X
b` `=1 x∈X
b`

!
X X [
c
= P {Ex,` (V)}, Xb` =V
`=1 V⊆X x∈V
∞ X
!
X [
c
= P {Ex,` (V)} P(Xb` = V)
`=1 V⊆X x∈V
X∞ X
δ|V|
≤ 2`2 |X | P(X`
b = V) ≤ δ
`=1 V⊆X

T T∞
Thus, in what follows, assume E := x∈X `=1 {Ex,` (Xb` )} holds.
Fix any ` for which x? ∈ Xb` (note x? ∈ Xb1 ). Then for any x ∈ Xb` we have

hx − x? , θb` i = hx, θb` − θ∗ i − hx? , θb` − θ∗ i + hx − x? , θ∗ i


≤ 2`

which implies x? ∈ Xb`+1 . Thus, x? ∈ Xb` for all `. On the other hand, any x for which hx? − x, θ∗ i > 4` we
have

max hx0 − x, θb` i ≥ hx? − x, θb` i


x0 ∈ X
b`

= hx? , θb` − θ∗ i − hx, θb` − θ∗ i + hx? − x, θ∗ i


> 2`

which implies maxx∈Xb`+1 hx, θ∗ i ≥ hx? , θ∗ i − 4` = hx? , θ∗ i − 8`+1 .

For any ` ≥ dlog2 (8∆−1 )e we have that Xb` = {x? }. Suppose you run for T timesteps. Then for any ν ≥ 0
the regret is bounded by:
dlog2 (8(∆∨ν)−1 )e
X X
∆x Tx = T ν + 8` (|support(λ
b` )| + τ` )
x∈X \x? `=1

dlog2 (8(∆∨ν)−1 )e
X
= Tν + 8` ( (d+1)d
2 + 2d−2 2
` log(4` |X |/δ))
`=1
dlog2 (8(∆∨ν)−1 )e
X
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) −1
)e + 16d−1 2
` log(4` |X |/δ)
`=1
dlog2 (8(∆∨ν)−1 )e
X
−1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 16d log(4 log22 (16(∆ ∨ ν) )|X |/δ) 2`
`=1
−1 −1 −1
≤ T ν + 4(d + 1)ddlog2 (8(∆ ∨ ν) )e + 512d(∆ ∨ ν) log(4 log22 (16(∆ ∨ ν) )|X |/δ)

Setting ν = 0 yields a regret bound of O(d∆−1 log(|X | log(∆−1 )/δ)) which implies RT ≤ c ∆
d
log(|X |T ). Mini-
p p
mizing over ν > 0 yields a regret bound of O( dT log(log(T /d)|X |/δ)) which implies RT ≤ c dT log(|X |T ).

Remarks:
• Let X = {ei : i ∈ [d]}. Then for this action set, this bound is nearly minimax according to our lower
bounds!

11
Pd
• However, this is also concerning: we know that in the bandit setting the regret scales like i=2 ∆−1
i log(T )
but this scales d∆−1 log(T ), which is significantly worse. Can we achieve this?
• For pure-exploration, an analogous analysis shows that one can identify the best-arm in ∆d2 log(1/δ)
pulls. But this is exactly the same rate we would have gotten if we did G-optimal once in the beginning
and sample according to that!

• Optimism won’t help here

7 Linear Bandits: Pure exploration


This section is inspired by [Fiez et al., 2019].
Showing that x? is the best arm is equivalent to showing that hx? − x, θ∗ i > 0 for all x ∈ X \ x? . Given
a finite number of observations, we have an estimate θb and a confidence set for θ∗ .
b = hx? − x, θb − θ∗ i + hx? − x, θ∗ i
hx? − x, θi
= hx? − x, θb − θ∗ i + ∆x
p
Recalling above, we have for any vector z ∈ Rd that |hz, θb − θ∗ i| ≤ kzk(X > X)−1 2 log(1/δ) w.p. ≥ 1 − δ.
We need to show that this confidence set is completely inside the x? region. Where we need to decrease
uncertainty is in the directions x − x? , clearly, which is not the G-optimal design. The most realistic
optimization program

ρ? := inf τ
λ∈4X ,τ ∈N

kx? − xk2(P τ λx xx> )−1 1


x∈X
subject to max ≤
x∈X ∆2x 2
kx? − xk2(P λx xx> )−1
x∈X
= inf max
λ∈4X x∈X ∆2x

Once can prove a lower bound of log(1/2.4δ)ρ? .


Input: Finite set X ⊂ Rd , confidence level δ ∈ (0, 1).
Let Xb1 ← X , t ← 1
while |Xb` | > 1 do
Let λb` ∈ 4X be a d(d+1) -sparse minimizer of f (λ; Xb` ) where
2
f (V) = inf f (λ; V) = inf max
0
kx − x0 k2(P λx xx> )−1
λ∈X λ∈X x,x ∈V x∈X

Set ` = 2−` , τ` = 2−2 2


` f (X` ) log(4` |X |/δ)
b
Pull arm x ∈ X exactly dτ` λ`,x e times and construct θb`
b
Xb`+1 ← Xb` \ x ∈ Xb` : maxx0 ∈Xb` hx0 − x, θbt i > `


t←t+1
Output: Xbt+1

Lemma 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. With probability at least 1 − δ, we have x? ∈ Xb` and
maxx∈Xb` hx? − x, θ∗ i ≤ 4` for all ` ∈ N.

Proof. For any V ⊆ X and x ∈ V define

Ex,` (V) = {|hx − x? , θb` − θ∗ i| ≤ ` }

12
where it is implicit that θb` is the design constructed in the algorithm at stage ` with respect to Xb` = V.
Given Xb` , with probability at least 1 − 2`2δ|X |
p
|hx − x? , θb` − θ∗ i| ≤ kx − x? k(Px∈V dτ` λ`,x (V)exx> )−1 2 log(4`2 |X |/δ)
kx − x? k(Px∈V λ`,x (V)xx> )−1 p
≤ √ 2 log(4`2 |X |/δ)
τ`
v
u kx − x? k2(P
u
> −1 p
x∈V λ`,x (V)xx )
≤t −2 2
2 log(4`2 |X |/δ)
2` f (V) log(4` |X |/δ)
= `
T∞ T
By exactly the same sequence of steps as above, we have P( `=1 x∈Xb` {|hx − x? , θbt − θ∗ i| > t }) =
T T∞ 
0
P `=1 Ex,` (X` ) ≥ 1 − δ, so assume these events hold. Consequently, for any x ∈ X`
b b
x∈X

hx0 − x? , θb` i = hx0 − x? , θb` − θ∗ i + hx0 − x? , θ∗ i


≤ hx0 − x? , θb` − θ∗ i
≤ `

so that x? would survive to round ` + 1. And for any x ∈ Xb` such that hx? − x, θ∗ i > 2` we have

max hx0 − x, θb` i ≥ hx? − x, θb` i


x0 ∈X
b`

= hx? − x, θb` − θ∗ i + hx? − x, θ∗ i


> −` + 2`
= `

which implies this x would be kicked out. Note that this implies that maxx∈Xb`+1 hx? −x, θ∗ i ≤ 2` = 4`+1 .

Theorem 5. Assume that maxx∈X hx? − x, θ∗ i ≤ 2. Then with probability at least 1 − δ, x? is returned from
the algorithm at a time τ that satisfies

τ ≤ cρ? log(∆−1 )[log(1/δ) + log(log(∆−1 )) + log(|X |)].

Proof. Define S` = {x ∈ X : hx? − x, θ∗ i ≤ 4` }. Note that by assumption X = Xb1 = S1 . The above lemma
T∞
implies that with probability at least 1 − δ we have `=1 {Xb` ⊆ S` }. This implies that

f (Xb` ) = min max kx − x0 k2(P λx xx> )−1


λ∈4X x,x0 ∈X
b` x∈X

≤ min max
0
kx − x0 k2(P λx xx> )−1
λ∈4X x,x ∈S` x∈X

= f (S` )

13
For ` ≥ dlog2 (4∆−1 )e we have that S` = {x? }, thus, the sample complexity to identify x? is equal to

dlog2 (4∆−1 )e dlog2 (4∆−1 )e  


X X X (d+1)d
dτ` λ
b`,x e =
2 + τ`
`=1 x∈X `=1
dlog2 (4∆−1 )e  
X (d+1)d
= 2 + 2−2 b 2
` f (X` ) log(4` |X |/δ)
`=1
dlog2 (4∆−1 )e
(d+1)d
X
≤ 2 dlog2 (4∆−1 )e + 2−2 2
` f (S` ) log(4` |X |/δ)
`=1
dlog2 (4∆−1 )e
4 log22 (8∆−1 )|X |
(d+1)d
X
≤ 2 dlog2 (4∆−1 )e + 4 log( δ ) 22` f (S` ).
`=1

We now note that


kx − x? k2(P λx xx> )−1
ρ? = inf max x∈X

λ∈4X x∈X (hx − x? , θ∗ i)2


kx − x? k2(P λx xx> )−1
x∈X
= inf max max
λ∈4X `≤dlog2 (4∆−1 )e x∈S` (hx − x? , θ∗ i)2
dlog2 (4∆−1 )e kx − x? k2(P
1 X
x∈X λx xx> )−1
≥ inf max
dlog2 (4∆−1 )e λ∈4X x∈S` (hx − x? , θ∗ i)2
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` inf max kx − x? k2(P λx xx> )−1
16dlog2 (4∆−1 )e λ∈4X x∈S` x∈X
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` inf max kx − x0 k2(P λx xx> )−1
64dlog2 (4∆−1 )e λ∈4X x,x0 ∈S` x∈X
`=1
dlog2 (4∆−1 )e
1 X
≥ 22` f (S` )
64dlog2 (4∆−1 )e
`=1

where we have used the fact that maxx,x0 ∈St kx − x0 k2(P ? 2P


> −1 ≤ 4 maxx∈St kx − x k( > −1 by
x∈X λx xx ) x∈X λx xx )
the triangle inequality.

8 Linear bandits: regret minimization revisited


Okay, now that we know how to do optimal pure exploration, how do we turn this into an algorithm that is
optimal? PT
Let RT (X , θ) = Eθ [ t=1 ∆Xt ], ∆x = maxx0 ∈X hx0 − x, θi
The next theorem is from [Lattimore and Szepesvári, 2020].
Theorem 6. Fix any X ⊂ Rd that spans Rd and θ∗ ∈ Rd such that arg maxx∈X hx, θ∗ i is unique. Any policy
(X ,θ ∗ )
for which RT (X , θ∗ ) = o(T a ) for any a > 0 also satisfies lim inf T →∞ RTlog(T ?
) ≥ r where
X
r? := inf αx ∆x
α∈[0,∞)X
x∈X
kx? − xk2(P αx xx> )−1 1
x∈X
subject to max ≤
x∈X ∆2x 2

14
Note that
1X
ρ? := inf αx
α∈[0,∞)X 2
x∈X
kx? − xk2(P αx xx> )−1 1
x∈X
subject to max ≤
x∈X ∆2x 2
Notes
• There exists an asymptotic algorithm [Lattimore and Szepesvari, 2016], but no satisfying finite-time
algorithm as of yet.
• Information directed sampling may be near-optimal and very high performance.

9 Adversarial Linear Bandits


Protocal for Linear Bandits
Input: Time horizon T , action set A ⊂ Rd .
Initialize: Adversary chooses {zt }Tt=1 ⊂ Rd .
for: t = 1, · · · , T
Player chooses action At ∈ A
Player suffers (and observes) loss `(At , zt ) = A>
t zt

9.1 Preliminaries
This presentation of mirror descent follows [Bubeck et al., 2012, Ch. 5].
For any open convex set D ⊂ Rd and its closure denoted D̄, for any Legendre F on D̄ define F ∗ (x) :=
supy∈D̄ x> y − F (y).
Define DF (x, y) = F (x) − F (y) − (x − y)> ∇F (y).
Let the Mirror Descent iterations satisfy, a1 = arg mina∈A F (a) then

at+1 = ∇F ∗ (∇F (at ) − η∇`(at , zt ))


e (1)
at+1 = arg min DF (a, e
at+1 ) (2)
a∈A

where we have assumed the iterates exist.


Theorem 7 (Online Mirror Descent). Let A ⊂ Rd be a closed convex action set, ` a subdifferentiable loss,
and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ dom(∇F (D)) for all (a, z) ∈ A × Z
is satisfied. Then OMD satisfies
T T
X supa∈A F (a) − F (a1 ) 1 X
sup `(at , zt ) − `(a, zt ) ≤ + DF ∗ (∇F (at ) − η∇`(at , zt ), ∇F (at )) .
a∈A t=1 η η t=1

Online Mirror Descent with Linear Losses


Input: Time horizon T , convex action set A ⊂ Rd .
Initialize: Player sets a1 = arg mina∈A F (a). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player suffers (and observes) loss `(at , zt ) = a>
t zt
Player observes zt
Update mirror descent iterates:

at+1 = ∇F ∗ (∇F (at ) − ηzt )


e
at+1 = arg min DF (a, e
at+1 )
a∈A

15
Corollary 1 (Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a closed convex action set,
{zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
T T
X
> supa∈A F (a) − F (a1 ) 1 X
sup (at − a) zt ≤ + DF ∗ (∇F (at ) − ηzt , ∇F (at )) .
a∈A t=1 η η t=1

Stochastic Online Mirror Descent with Linear Losses


Input: Time horizon T , action set A ⊂ Rd .
Initialize: Player sets a1 = arg mina∈A F (a). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T P
Player chooses distribution Pt over A with at = E[At |Pt ] = a∈A aPt (a)
Player samples At from Pt and suffers (and observes) loss `(At , zt ) = A>
t zt
zt |Pt ] = zt
Player computes estimate zbt with E[b
Update mirror descent iterates:

at+1 = ∇F ∗ (∇F (at ) − ηb


e zt )
at+1 = arg min DF (a, e
at+1 )
a∈convhull(A)

Corollary 2 (Stochastic Online Mirror Descent with Linear Losses). Let A ⊂ D ⊂ Rd be a finite action set,
zt }Tt=1 ⊂ Z, `(a, z) = a> z, and F a Legendre function defined on A ⊂ D̄, such that ∇F (a) − ηz ∈ ∇F (D)
{b
for all (a, z) ∈ A × Z is satisfied. Then OMD satisfies
" T # T
X
> supa∈A F (a) − F (a1 ) 1 X
sup E (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] .
a∈A t=1
η η t=1

Proof. Applying Corollary 1 with zbt we have


T T
X supa∈A F (a) − F (a1 ) 1 X
sup (at − a)> zbt ≤ + DF ∗ (∇F (at ) − ηb
zt , ∇F (at )) .
a∈A t=1 η η t=1

Taking the expectation on both sides yields the result by noting that
" T # " T # " T # " T # " T #
X X X X X
> > > > >
E zt (At − a) = E zt (at − a) = E E[zt (at − a)|Pt ] = E zt (at − a)|Pt ] = E
E[b zbt (at − a) .
t=1 t=1 t=1 t=1 t=1

9.2 Simplex with unnormalized negative entropy


Pd Pn
Example 1 Let A = {x ∈ Rd : xi ≥ 0, i=1 = 1}, F (x) = i=1 xi log(xi ) − xi with D = (0, ∞)d . F is
Legendre and

[∇F (x)]i = log(xi )


d
X d
X
DF (x, y) = xi log( xyii ) − (xi − yi )
i=1 i=1
d
X
F ∗ (x) = exp(xi )
i=1
[∇F ∗ (x)]i = exp(xi )
d
X
D F∗ (x, y) = exp(yi )(exp(xi − yi ) − 1 − (xi − yi ))
i=1

16
9.2.1 Full information game
Assume the setting of Example 1. The following algorithm implements the updates of Mirror Descent above
for the particular loss `(a, z) = a> z.

Exponential Weights
Input: Time horizon T .
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player chooses at ∈ A
Player suffers (and observes) loss `(at , zt ) = a>
t zt
Player observes zt
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 3 (Exponential weights). Under the conditions of Example 1 with let `(a, z) = a> z, the expo-
nential weights algorithm satisfies
T
X log(d) ηT p
sup `(at , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2

Proof. Note ∇a `(a, z) = z. Plug in quantities of the example to obtain for any a ∈ A
T
X T
X
`(at , zt ) − `(a, zt ) = zt> (at − a)
t=1 t=1
T d
log(d) 1 X X
≤ + at,i (exp(−ηzt,i ) − 1 + ηzt,i )
η η t=1 i=1
T d
log(d) η X X 2
≤ + at,i zt,i
η 2 t=1 i=1
log(d) ηT
≤ +
η 2

where the second line uses F (x) ≤ 0 and F (a1 ) = log(d), the third line uses exp(−x) ≤ 1 − x + 21 x2 for
x ≥ 0, and the last line follows from zt,i ∈ [0, 1] and at is a probability distribution.

9.2.2 Full information, finite action set


Analogous to the categorial weather prediction problem in class, we now consider the case where the player
can only play from a distinct set {1, . . . , d} (i.e., predict rain, snow, sunny). As discussed in class, any
deterministic algorithm will suffer linear regret, so instead, at time t we choose a probability distribution
at ∈ A (in the setting of Example 1), choose distinct action It drawn according to at so that At := eIt , and
Pd
then suffer loss `(At , zt ) = A>
t zt = zt,It . Note that E[At ] = E[eIt ] = i=1 ei at,i = at so that E[`(At , zt )] =
E[A> >
t zt ] = at zt . Thus, the expected regret relative to any probability distribution a ∈ A over distinct items

17
in hindsight is
" T
# " T #
X X
E `(At , zt ) − `(a, zt ) = E zt> (At − a)
t=1 t=1
" T #
X
=E zt> (at − a)
t=1
" T
#
X
=E `(at , zt ) − `(a, zt )
t=1

where the expectation is with to the random selection of each It from at . Alternatively, we can directly
apply Corollary 2 with zbt = zt since we are in the full information setting.

Exponential Weights over finite actions


Input: Time horizon T .
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player chooses at ∈ A
Player draws It ∼ at , sets At = eIt and suffers (and observes) loss `(At , zt ) = A>
t zt =
zt,It
Player observes zt
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 4 (Exponential weights over finite actions). Under the conditions of Example 1 where the player
can only play ei for i ∈ {1, . . . , d} with let `(a, z) = a> z, the exponential weights over finite actions algorithm
satisfies
" T
#
X log(d) ηT p
E sup `(eIt , zt ) − `(a, zt ) ≤ + ≤ 2T log(d)
a∈A t=1 η 2

Proof. Immediate from reduction described above and the previous corollary since the iterates are identical
in the full information game. Due to the oblivious adversary we have
" T # " T
#
X X
sup E `(at , zt ) − `(a, zt ) = E sup `(at , zt ) − `(a, zt )
a∈A t=1 a∈A t=1

Note, as we did in class, one can also prove a high probability bound that would apply to a general reactive
adversary [?, Ch. 2.7]

9.2.3 Bandit feedback over finite action sets


This setting is identical to the previous setting, but now we do not observe the entire vector zt at each time
t, we only observe the element we played zt,It . Using this single value, the player constructs an unbiased
1{It =i}zt,i
estimate of zt with zbt,i = at,i for all i. Note that

1{It = i}zt,i
zt,i |a1 , I1 , . . . , at−1 , It−1 , at ] = E[
E[b |a1 , I1 , . . . , at−1 , It−1 , at ]
at,i
d
X 1{j = i}zt,i
= at,j
j=1
at,i
= zt,i .

18
Pd
Also note that E[At ] = E[eIt ] = i=1 ei at,i = at .

EXP3: Exponential Weights for Exploration Exploitation


Pd
Input: Time horizon T , A = {x ∈ Rd : xi ≥ 0, i=1 = 1}.
Initialize: Player sets a1 = (1/d, . . . , 1/d). Adversary chooses {zt }Tt=1 ⊂ [0, 1]d .
for: t = 1, · · · , T
Player draws It ∼ at and suffers (and observes) loss `(eIt , zt ) = zt,It
1{It =i}zt,i
Player sets zbt,i = at,i
Update mirror descent iterates:
t
X d
X
at+1,i = exp(−η
e zbs,i ) at,i = e
at+1,i / at+1,j .
e
s=1 j=1

Corollary 5 (EXP3). Under the conditions of Example 1 where the player can only play ei for i ∈ {1, . . . , d}
with `(a, z) = a> z and only observe bandit feedback, the EXP3 algorithm satisfies
" T # T
" d #
X log(d) η X X
2
sup E `(At , zt ) − `(a, zt ) ≤ + E at,i zbt,i
a∈A t=1
η 2 t=1 i=1
log(d) ηT d p
≤ + ≤ 2dT log(d)
η 2
Proof. We can directly apply Corollary 2:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb zt , ∇F (at ))]
a∈A t=1 η η t=1
T
" d #
log(d) 1 X X
= + E at,i (exp(−ηb zt,i ) − 1 + ηb
zt,i )
η η t=1 i=1
T
" d #
log(d) η X X
2
≤ + E at,i zbt,i
η 2 t=1 i=1
T
" d #
2
log(d) η X X 1{It = i}zt,i
= + E at,i
η 2 t=1 i=1
a2t,i
T d
log(d) η X X 2
= + z
η 2 t=1 i=1 t,i
log(d) ηdT
≤ +
η 2

For an alternative proof of EXP3, see [Lattimore and Szepesvári, 2020, Ch. 11]

9.3 Other action sets


The previous section addressed the case of the action set being equal to the simplex: A = {x ∈ Rd : xi ≥
Pd Pd
0, i=1 = 1}. As our Legendre potential we chose unnormalize negative entropy F (x) = i=1 xi log(xi )−xi .
Consider what the guarantee would be from Corollary 2 if we chose a different function, say, F (x) = 12 kxk22

19
then:
" T
# T
X
> supa∈A F (a) − F (a1 ) 1 X
E sup (At − a) zt ≤ + E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))]
a∈A t=1 η η t=1
T
1 ηX 
zt k22

≤ + E kb
η 2 t=1
T
" d #
1 ηX X
2
= + E zbt,i
η 2 t=1 i=1
T
" d #
X zt,i 2
1 ηX
= + E .
η 2 t=1 a
i=1 t,i

The issue here is that at,i can become arbitrarily close to 0 and blow up the bound. If we mix in unfirom
2/3
exploration
p at each round, one can show that the regret bound is O((dT ) ) which is significantly worse
than O( dT log(d)) of EXP3 above. So given an action set A how do we choose F ? The next proposition
sheds some light on this question.
Proposition 3. If F is twice continuously differntiable, and if its Hessian ∇2 F (x) is invertible ∀x ∈ D,
then ∀x, y ∈ D, there exists ζ ∈ D such that ∇F (ζ) ∈ [∇F (x), ∇F (y)] and
1
DF ∗ (∇F (x), ∇F (y)) = k∇F (x) − ∇F (y)k2(∇2 F (ζ))−1 .
2
The implication of the above proposition is that ∃∇F (ζt ) ∈ [∇F (at ) − ηb
zt , ∇F (at )] and

η2
DF ∗ (∇F (at ) − ηb
zt , ∇F (at )) = zt k2(∇2 F (ζt ))−1
kb
2
For the choice of F (x) = 12 kxk22 we have ∇2 F (ζt ) = I for any ζt so that the Hessian is flat across the action
Pd
set. On the other hand, with the choice F (x) = i=1 xi log(xi )−xi , we have ∇2 F (x) = diag(1/x1 , . . . , 1/xd )
which blows up as a component of x approaches 0. But this is perfect, since then
 2 
η 2
E [DF ∗ (∇F (at ) − ηb
zt , ∇F (at ))] = E kb
zt k(∇2 F (ζt ))−1
2
" d
#
η2 X 2
=E zb ζt,i
2 i=1 t,i
" d
#
2
η 2 X zt,i
=E ζt,i
2 i=1 at,i
d
η2 X 2
≈ z
2 i=1 t,i

where we have used the approximation that ζt ≈ at . The hessian of F is blowing up precisely at the locations
where zbt blows up, essentially cancelling each other! We’ll see another example of this in the next subsection.

9.3.1 Unit ball action set


Here we address the action set A = {x ∈ Rd : kxk2 ≤ 1}. To use Corollary 2, we need to define Pt (or
equivalently, At ) to make sure that E[At ] = at and we need to define zbt with E[b
zt ] = zt . Consider the
following choices:
• ξt ∼ Bernoulli(kat k2 ), It ∼ uniform([d]), t ∈ {−1, 1} with equal probability
• At = (1 − ξt )t eIt + ξt kaattk2

20
d
• zbt = (1 − ξt ) 1−ka t k2
At A>
t zt

It is straightforward to verify that E[Ath|at ] = at and E[b zti|at ] = zt . Following the intuition of the previous
η2 2
section, we need to choose F to make E 2 kb zt k(∇2 F (ζt ))−1 small. Let F (x) = − log(1 − kxk2 ) − kxk2 . Note
that
 
h
2
i d > 2
E kbzt k(∇2 F (ζt ))−1 = E k(1 − ξt ) At At zt k(∇2 F (ζt ))−1
1 − kat k2
d2
 
> 2
=E kAt At zt k(∇2 F (ζt ))−1 |ξt = 0
1 − kat k2
d
X 1 d2
= kei e> 2
i zt k(∇2 F (ζt ))−1
i=1
d 1 − kat k2
d
= kzt k2(∇2 F (ζt ))−1
1 − kat k2
d
≤ (1 − kζt k2 )kzt k22
1 − kat k2
≤ 2d
1−kzt k2
where we have used the fact that ∇2 F (x)  I/(1−kxk2 ) and 1−ka t k2
≤ 2 (see [Lattimore and Szepesvári, 2020]
for second fact). √
Using a shrunken action set, one can prove that the regret is bounded by O( dT ) (see [Lattimore and Szepesvári, 2020]).

9.3.2 Finite action sets


Here we study the case when |A| = n and A = {x1 , . . . , xn } ⊂ Rd and maxx∈A maxt |x> zt | ≤ 1. Our
approach will rely on a slight modification of the EXP3 algorithm:

EXP3(γ): Exponential Weights for Exploration Exploitation


Input: Time horizon T , n arms, η > 0, γ ∈ [0, 1], λ ∈ 4n .
Initialize: Player sets p1 = (1/n, . . . , 1/n) ∈ 4n . Adversary chooses {yt }Tt=1 ⊂ [−1, 1]n .
for: t = 1, · · · , T
Player draws It ∼ qt := (1 − γ)pt + γλ and suffers (and observes) loss `(It , yt ) = yt,It
Player computes ybt,i where E[byt,i |pt ] = yt,i
Update iterates:
t
X n
X
pet+1,i = exp(−η ybs,i ) pt,i = pet+1,i / pet+1,j .
s=1 j=1

A straightforward modification for the stochastic mirror descent proof accounts for the forced exploration:
Proposition 4 (EXP3(γ)). The regret of EXP3(γ) algorithm satisfies
" T # T
" n #
X log(n) 1X X
max E yt,It − yt,i ≤ + 2γT + E qt,i φ(−ηb
yt,i )
i∈[n]
t=1
η η t=1 i=1

where φ(x) = ex − 1 − x ≤ x2 for |x| ≤ 1.


We will use the EXP3(γ) algorithm as follows:
Pn
• Set At = xIt , Qt = i=1 qt,i xi x> bt = Q−1
i , z
>
t At At zt

• yt,i = x> bt,i = x>


i zt , y i z
bt
to obtain the following theorem:

21
Theorem 8. Let |A| = n and A = {x1 , . . . , xn } ⊂ Rd and maxx∈A maxt |x> zt | ≤ 1. Using the reduction to
EXP3(γ) with γ = ηd we have
" T #
X
> log(n) p
max E (At − a) zt ≤ + 3ηdT ≤ 3dT log(n)
a∈A
t=1
η

First note that

yt,i |pt ] = E[x>


E[b bt |pt ]
i z
−1
= E[x> >
i Qt At At zt |pt ]
−1
= x> >
i Qt E[At At |pt ]zt

= x>
i zt = yt,i

We also have
2
yt,i
E[b ] = E[(x> bt )2 |pt ]
i z
−1 > −1
= E[x> > 2
i Qt At At Qt xi (At zt ) |pt ]
−1
≤ x>
i Qt xi · h
2

so that
" n
# n
X X
−1
E 2
qt,i ybt,i ≤ h2 qt,i x>
i Qt xi
i=1 i=1
n
X
−1
≤ h2 Trace(qt,i xi x>
i Qt )
i=1
≤ h Trace(Qt Q−1
2
t ) = dL
2

Finally,

yt,i | = |ηx>
|ηb bt |
i z
−1
= |ηx> >
i Qt At At zt |
−1
≤ ηh|x>
i Qt At |
−1
= ηh|x>
i Qt xIt |
≤ ηhkxi kQ−1 kxIt kQ−1
t t

≤ ηhkxi k(γ Pni=1 λi xi x>


i )
−1 kxIt k(γ
Pn > −1
i=1 λi xi xi )

ηhd

γ
so it suffices to take γ = ηhd.

10 Contextual Bandits
This section is inspired by [Lattimore and Szepesvári, 2020]
For t = 1, 2, . . .
iid
• Nature reveals ct ∼ D

• Player chooses xt ∈ X and observes yt = v(ct , xt ) + t

22
which models users showing up to websites, or patients showing up to the doctor with different symptoms.
A policy π maps contexts to actions in [n] and the value of a policy π is defiend as
V (π) = EC, [v(C, π(C)) + ] = EC [v(C, π(C))]
At each time, we assume the action taken is according to some policy πt ∈ Π so that the regret is defined as
XT
RT = T · V (π ? ) − E[ V (πt )]
t=1

Notes
• This is not the only way we could have defined regret. Perhaps an even more natural approach would
be to define the optimal policy wrt the actual rewards, now the expected rewards.
One naive way to handle this is to run a different linear bandit for each ct , but this is impractical for
infinite sets of ct .

10.1 Stochastic Linear model


Suppose v(c, x) = hφ(c, x), θ∗ i. This can be learned through historical data (e.g., a deep network), or through
polynomials of inputs. We can restate the above models as For t = 1, 2, . . .
• Nature reveals (xt,1 , . . . , xt,n ) = Xt ⊂ Rd
• Player chooses It ∈ [n] and observes yt = hxt,It , θ∗ i + t
When we had a fixed action set, we built confidence intervals on hxi , θb − θ∗ i. Now that we don’t
know what action sets to expect, a natural to assume maxi,t kxi,t k ≤ 1 and build confidence intervals
on supu:kuk2 ≤1 hu, θb − θ∗ i = kθb − θ∗ k2 , or equivalently, define a set Ct with the guarantee that θ∗ ∈ Ct
for all t. When an action set Xt shows up, we could eliminate all provably sub-optimal arms by setting
Xbt = X \ {x : maxx0 ∈X hx0 − x, θi < 0 ∀θ ∈ Ct }. An alternative is to run UCB, defining:
U CBt (x) = maxhx, θi
θ∈Ct

and play xt = arg maxx∈Xt U CBt (x). If x?t = arg maxx∈Xt hx, θ∗ i then

hx?t , θ∗ i ≤ U CBt (x?t ) ≤ U CBt (xt ) = hxt , θi.


e

Thus, the instantaneous regret at time t satisfies


rt = hx?t − xt , θ∗ i
≤ hxt , θe − θ∗ i
≤ kxt kA−1 kθe − θ∗ kAt−1
t−1
p
≤ 2kxt kA−1 βt−1
t−1

Thus, the random regret satisfies


v v
T
X
u T
u X
u
u T
X
R
bT = rt ≤ T
t 2
rt ≈ t2T βT kxt k2A−1
t−1
t=1 t=1 t=1

Let θ̂t be the `2 -regularized least-squares estimate of θ∗ with regularization parameter λ > 0 given by
θ̂t = arg min kX1:t θ − Y1:t k + λkθk22 = (XT1:t X1:t + λI)−1 XT1:t Y1:t
θ

where we are denoting X1:t as a matrix with rows X1T , X2T , . . . , XtT and Y1:t as the vector (Y1 , . . . , Yt )T . The
following theorem says that with high probability θ∗ lies with high probability in an ellipsoid with center at
θ̂t .

23
Theorem 9. Confidence Ellipsoid. Assume the same as in Theorem ??, let V = Iλ, λ > 0, define Yt =
hXt , θt i + ηt and assume that kθ∗ k ≤ S. Then for any δ > 0, with probability at least 1 − δ, for all t ≥ 0, θ∗
lies in the set
 s 
 det(V ) 1/2 ) det(λI)−1/2 
t
Ct = θ ∈ Rd : kθ̂ − θkV t ≤ R 2 log( ) + λ1/2 S .
 δ 

Furthermore, if for all t ≥ 1, kXt k2 ≤ L then with probability at least 1 − δ, for all t ≥ 0, θ∗ lies in the set
( r )
0 d 1 + tL2 /λ 1/2
Ct = θ ∈ R : kθ̂ − θkV t ≤ R d log( )+λ S .
δ

10.2 Stochastic Contextual Bandits for General policy classes


This section is inspired by Alekh’s Agarwal’s excellent notes on contextual bandits. See [Link]
[Link]/courses/cse599m/19sp/.
Let’s return to the general setting of trying to minimize V (π ? ) − V (π). The core diffuclty of contextual
bandits is that if I take action i and receive a reward with mean v(ct , i), I don’t observe v(ct , j) for some
j 6= i. If I don’t see that context ever again, how can I know what I should have done?

10.2.1 τ -greedy
Here we explore a simple strategy that explores for τ steps then exploits or the last T − τ steps. For any
context c fix an exploration distribution µ(x|c) ∈ 4n such that µ(x|c) > 0 for all x, c.
Then suppose at time t, I observed some context ct and played xt ∼ µ(·|c) to receive reward rt =
v(ct , xt ) + t . Then define the inverse propensity scoring estimator as
1{xt = x}
vb(ct , x) = rt
µ(x|ct )
Note that
 
1{xt = x}
v (ct , x)|ct ] = E rt
E[b |ct
µ(x|ct )
 
X 1{xt = x}
= µ(x0 |ct )E rt |ct , xt = x0
0
µ(x|ct )
x ∈X
X 1{x0 = x}
= µ(x0 |ct )v(ct , x0 )
µ(x|ct )
x0 ∈X
= v(ct , x)
with variance
v (ct , x) − v(ct , x))2 |ct ] ≤ E[(b
E[(b v (ct , x))2 |ct ]
 
1{xt = x}
≤E |c t (rt ≤ 1)
µ(x|ct )2
X 1{x0 = x}
≤ µ(x0 |ct )
0
µ(x|ct )2
x ∈X
1
=
µ(x|ct )
1
Pt
and trivially, vb(c, x) ≤ vbmax := maxc,x µ(x|c) . Finally, if Vbt (π) = 1s s=1 vb(ct , π(ct )) then E[Vbt (π)] = V (π)
for any π ∈ Π. By Bernstein’s inequality we have with probability at least 1 − δ
s  
2 log(2/δ) 1 2b
vmax log(2/δ)
|V (π) − Vt (π)| ≤
b Ec,x∼µ +
t µ(π(c)|c) 3t

24
1
and in particular, if µ(x|c) = for all x, c then
n
r r
2n log(2/δ) 2n log(2/δ) 4n log(2/δ)
|V (π) − Vt (π)| ≤
b + ≤
t 3t t
q
for some t ≥ 2n log(2/δ). Define t := 4n log(2|Π|/δ)
t . If π
b = arg maxπ∈Π Vbt (π)

V (b π ) − Vbt (b
π ) = V (b π ) − Vbt (π ? ) + Vbt (π ? ) − V (π ? ) +V (π ? )
π ) + Vbt (b
| {z } | {z } | {z }
≥−t ≥0 ≥−t
?
≥ V (π ) − 2t

If we explore uniformly for τ rounds than exploit or T − τ rounds then we achieve a regret of at most
r
˙ − τ ) ≤ τ + T 4n log(2|Π|/δ)
1 · t + 2τ (T
τ
which is minimized at τ = (nT 2 log(2|Π|/δ))1/3 which yields a regret of O(T 2/3 (n log(2|Π|/δ))1/3 ) regret.
1
Pm
Lemma 6 (Bernstein’s inequality). Let X1 , . . . , Xm be independent random variables such that m i=1 E[(Xi −
E[Xi ])2 ] ≤ σ 2 and |Xi | ≤ B. Then
m
r
1 X 2σ 2 log(2/δ) 2B log(2/δ)
Xi ≤ +
m i=1 m 3m

with probability at least 1 − δ.

10.2.2 Action Elimination


We will make the strong assumption that the distribution of contexts is known a priori. Not so implausible
due to historical data.
Recall the value of a policy π ∈ Π as V (π) = Ec [v(c, π(c))]. Taking our G-optimality approach, we wish
to sequentially define probability vectors at each time and play probabilistically following

min max E[(Vbt (π) − V (π))2 ]


λ∈∆Π
b π∈Π
b

b ⊂ Π.
for some active set Π
Input: Policy set Π such that π : X → [n] for all π ∈ Π, confidence level δ ∈ (0, 1).
Let Πb 1 ← Π, ` ← 1
while |Π b ` | > 1 do
q
log(2|Π|T /δ)
` = 2−` , τ` = d16n−2 ` log(2|Π|T /δ)e, γ` = min{ 2n ,
1
9nτ`
}
h i
1
Q` = arg minQ∈4Πb maxπ∈Π b ` EC Qγ (π(C)|C)
`
γ P
s.t. Q (x|c) = γ + (1 − γn) π∈Π b ` :π(c)=x Q(π)
for t = T`−1 + 1, . . . , T`
Observe context ct
Play xt ∼ Qγ (·|ct ), set pt = Qγ (xt |ct ) and observe reward rt = v(ct , xt ) + ηt
1
P 1{π(ct )=ct }
Set Vb` (π) = T` −T `−1 t∈(T`−1 ,T` ] rt pt
b ` | max 0 b Vb` (π 0 ) − Vb` (π) ≥ 2`

b `+1 ← Π
Π b` \ π ∈ Π
π ∈Π`
t←t+1
Output: Πt+1
We state without proof:
1
Lemma 7. For any finite policy set Π and γ ≤ 2n we have
 
1
min max EC ≤ 2n
Q∈4Π π∈Π Qγ (π(C)|C)

25
Lemma 8. For all ` = 1, 2, . . . we have π ? ∈ Π
b ` and max b V (π) ≥ V (π ? ) − 8` .
π∈Π`
h i
1
Proof. Let τ` = T` − T`−1 . Noting that the variance of Vb` (π) is bounded by maxπ∈Π
b ` EC Qγ
, we
` (π(C)|C)
apply Bernstein’s inequality at each stage ` to find
s
4n log(2|Π|T /δ) 2 log(2|Π|T /δ)
|V (π) − Vb` (π)| ≤ +
τ` 3γ` τ`
s
16n log(2|Π|T /δ)

τ`
q
1 log(2|Π|T /δ)
for the choice of γ` = min{ 2n , 9nτ` } to equalize the terms for large τ` . The last inequality holds if
τ` ≥ n log(2|Π|T /δ). To make the right hand side less than ` , it suffices to take τ` = d16n−2
` log(2|Π|T /δ)e.
b ` with π ? ∈ Π
For any fixed Π b ` , we have that any π ∈ Π
b ` satisfies

Vb` (π) − Vb` (π ? ) = Vb` (π) − V (π) + V (π) − V (π ? ) +V (π ? ) − Vb` (π ? )


| {z }
≤0

≤ 2` .

On the other hand, for any π such that V (π ? ) − V (π) > 4`

max Vb` (π 0 ) − Vb` (π) ≥ Vb` (π ? ) − Vb` (π)


π 0 ∈Π
b`

= Vb` (π ? ) − V (π ? ) + V (π ? ) − V (π) +V (π) − Vb` (π)


| {z }
>4`

> 2`
? ?
which implies this π will be kicked out. This means that maxπ∈Π b `+1 V (π) ≥ V (π ) − 4` ≥ V (π ) − 8`+1 .

Extending the proof to all ` and random Π b ` is identical to above for linear bandits.

Suppose you run for T timesteps. Let ∆ = minπ6=π? V (π ? ) − V (π). Then for any ν ≥ 0 the regret is
bounded by:

dlog2 (4(∆∨ν)−1 )e
X
T ν+ (γ` n + 8` (1 − γ` n))τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X log(2|Π|T /δ)
= Tν + (n 9nτ` + 8` )τ`
`=1
dlog2 (4(∆∨ν)−1 )e q
X
= Tν + n d16n−2 −2
` log(2|Π|T /δ)e log(2|Π|T /δ)/9n + 8` d16n` log(2|Π|T /δ)e
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ Tν + 2n−1 −1
` log(4|Π|T /δ) + 128n` log(4|Π|T /δ)
`=1
dlog2 (4(∆∨ν)−1 )e
X
≤ T ν + 8 + 130n log(2|Π|T /δ) 2t
t=1
≤ T ν + 8 + 2080n(∆ ∨ ν)−1 log(2|Π|T /δ).

p using the upper bound (∆ ∨ ν) ≤ ν and optimizing over ν we have that the regret is no great
As before,
than O( nT log(|Π|T /δ)).

26
10.3 EXP4 for oblivious adversaries
The following algorithm and proof are very standard. However, the textbooks [Lattimore and Szepesvári, 2020,
Bubeck et al., 2012] have some unfortunate typos and/or notation that I found confusing so I have repro-
duced EXP4 here.

EXP4: Exponential Weights for Exploration Exploitation


Input: Time horizon T , n arms, m expers, η > 0, γ ∈ [0, 1], λ ∈ 4n .
Initialize: Player sets Q1 = (1/m, . . . , 1/m) ∈ [0, 1]1×m . Adversary chooses {`t }Tt=1 ⊂
[0, 1]n .
for: t = 1, · · · , T
Nature reveals expert advice E (t) ∈ [0, 1]m×n
(t) Pm (t)
Set pt,i = EM ∼Qt [EM,i ] = j=1 Qt,j Ej,i
(t)
Player draws Mt ∼ Qt and It ∼ EMt ∈ 4n (equivalent to It ∼ pt ∈ 4n )
Player suffers (and observes) loss `t,It
1{It =i}`t,i
Estimate arm losses `bt,i = p
Pt,in (t)
Estimate expert losses ybt,j = i=1 Ej,i `bt,i for all j = 1, . . . , m
Update iterates:
t
X m
X
Q
e t+1,i = exp(−η ybs,i ) Qt,i = Q
e t+1,i / Q
e t+1,j .
s=1 j=1

First we will prove some simple identities:


n n
(t) (t)
X X
yt,j ] = E[
E[b Ej,i `bt,i ] = Ej,i `t,i
i=1 i=1

and
n n X
m m n m n
(t) (t)
X X X X X X
E[`t,It ] = E[ pt,i `t,i ] = Qt,j Ej,i `t,i = Qt,j Ej,i `t,i = E[ Qt,j ybt,j ]
i=1 i=1 j=1 j=1 i=1 j=1 i=1

The expert regret is defined as


" #  
T n T X
m n n
(t) (t) (t)
X X X X X
max E `t,It − Ek,i `t,i = max E  Qt,j Ej,i `t,i − Ek,i `t,i 
k=1,...,m k=1,...,m
t=1 i=1 t=1 j=1 i=1 i=1
 
XT X
m
= max E  Qt,j ybt,j − ybt,k 
k∈[m]
t=1 j=1
" T #
X
= max E ybt,Mt − ybt,k
k∈[m]
t=1
 
T m
log(m) η X X
2 
≤ + E Qt,j ybt,j
η 2 t=1 j=1

log(m) ηnT
≤ +
η 2
where the first two lines follow from plugging in the identities of above, the third from the definition of Mt ,

27
and the first inequality follows from the guarantee of EXP3. The last inequality follows from
" n #
X (t)
2
yt,j
E[b ]=E ( Ej,i `bt,i )2
i=1
 !2 
n
(t) 1{It = i}`t,i
X
= E Ej,i 
i=1
pt,i

(t)
!2 
Ej,It `t,It
=E  
pt,It

n (t)
!2
X Ej,i `t,i
= pt,i
i=1
pt,i
n (t)
X Ej,i

i=1
pt,i

so
 
m m n (t)
X
2 
X X Ej,i
E Qt,j ybt,j = Qt,j
j=1 j=1 i=1
pt,i
n Pm (t)
X j=1 Qt,j Ej,i
= =n
i=1
pt,i

28
References
[Audibert and Bubeck, 2009] Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic
bandits.
[Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47(2-3):235–256.
[Bubeck et al., 2012] Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic
multi-armed bandit problems. Foundations and Trends R in Machine Learning, 5(1):1–122.
[Cappé et al., 2013] Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., Stoltz, G., et al. (2013). Kullback–leibler
upper confidence bounds for optimal sequential allocation. The Annals of Statistics, 41(3):1516–1541.
[Fiez et al., 2019] Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. (2019). Sequential experimental design for
transductive linear bandits. In Advances in Neural Information Processing Systems, pages 10666–10676.
[Kaufmann et al., 2016] Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the complexity of best-arm identifi-
cation in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42.
[Lattimore, 2018] Lattimore, T. (2018). Refining the confidence level for optimistic bandit strategies. The Journal
of Machine Learning Research, 19(1):765–796.
[Lattimore and Szepesvari, 2016] Lattimore, T. and Szepesvari, C. (2016). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491.
[Lattimore and Szepesvari, 2017] Lattimore, T. and Szepesvari, C. (2017). The end of optimism? an asymptotic
analysis of finite-armed linear bandits. In Artificial Intelligence and Statistics, pages 728–737.
[Lattimore and Szepesvári, 2020] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. https: //
tor-lattimore. com/ downloads/ book/ book. pdf .
[Mannor and Tsitsiklis, 2004] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the
multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
[Pukelsheim, 2006] Pukelsheim, F. (2006). Optimal design of experiments. SIAM.
[Soare, 2015] Soare, M. (2015). Sequential resource allocation in linear stochastic bandits. PhD thesis, Université
Lille 1-Sciences et Technologies.
[Soare et al., 2014] Soare, M., Lazaric, A., and Munos, R. (2014). Best-arm identification in linear bandits. In
Advances in Neural Information Processing Systems, pages 828–836.
[Yu et al., 2006] Yu, K., Bi, J., and Tresp, V. (2006). Active learning via transductive experimental design. In
Proceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM.

Common questions

Powered by AI

High probability bounds are effective in adversarial settings as they provide a statistical guarantee on performance under the worst-case loss sequence, ensuring that the regret remains low with high confidence. This probabilistic assurance caters to robust algorithm design, capable of handling uncertainties and adversarial influences more reliably .

In the full information setting, the exploration-exploitation trade-off can be managed directly using the complete loss vectors, allowing precise adjustments based on observed data. In bandit settings, where only partial feedback is obtained, the trade-off is encoded through probabilistic estimates and stochastic approaches like EXP3, reflecting an indirect method of balancing exploration and exploitation amidst uncertainty .

Mirror Descent iterates adjust the decision variables by projecting the gradients of the loss function onto the feasible decision space with respect to a chosen distance-generating function. This method helps exploit the curvature of the space and can more effectively converge to an optimal solution by controlling the learning rate dynamically, which is especially useful in high-dimensional task environments .

The least squares estimator allows the algorithm to approximate the unknown parameter B8* based on the observations collected in each round. By constructing this estimator only with the data from the current round, the algorithm efficiently updates its understanding of the reward structures, thus enhancing its decision-making process for selecting arms that maximize expected reward while minimizing regret .

Exponential weights play a critical role in managing situations where the feedback available is limited to only the outcome associated with the chosen action. By assigning weights to actions based on their historical performance, the algorithm can adjust probabilities for choosing each action in future rounds. This approach enables the estimation of the outcomes for all actions despite observing only one, thus facilitating a balance between exploration and exploitation .

The primary goal of the Action Elimination Algorithm is to identify the best arm out of a set of multiple arms with high confidence. The algorithm aims to ensure that the optimal arm remains in the set of candidates throughout the iterations by eliminating arms that perform significantly worse than the current best estimate. This elimination process relies on calculating empirical means and their confidence intervals to decide whether an arm can be eliminated from further consideration .

The assumptions made regarding the distribution of rewards include the use of sub-Gaussian distributions, ensuring that variability in observations is controlled and that reward distributions do not deviate significantly from their mean. This assumption simplifies the regret analysis and calculation of confidence bounds in best-arm identification settings .

Neglecting practicalities can lead to a simplified analysis; however, it may result in algorithms that perform slightly weaker in practice. By sacrificing some performance potential for analytical simplicity, the behavior of algorithms in realistic environments might not be accurately captured, which can affect their reliability when implemented .

When B3 is set to zero, the regret bound of the bandit algorithm is O(d 94^{-1} log(|X|T)), and it implies a regret bounded by c d 94 log(|X|T). On the other hand, minimizing over B3 > 0, the regret bound becomes O(1Ap d T log(log(T/d)|X|/B4)), which implies a regret bounded by c 1Ap dT log(|X|T).

The EXP3 algorithm uses Exponential Weights for updating the probability distribution. This approach ensures that exploration is handled through random sampling based on current probability estimates, while also allowing exploitation of known rewarding actions by iteratively adjusting action probabilities .

You might also like