0 ratings 0% found this document useful (0 votes) 9 views 54 pages Reinforcement Learning
The document discusses Markov Decision Processes (MDPs) and the exploration vs. exploitation dilemma in decision-making, highlighting the importance of balancing information gathering with optimizing immediate rewards. It introduces the Multi-Armed Bandit (MAB) problem and various strategies for maximizing cumulative rewards, including epsilon-greedy, Upper Confidence Bound (UCB), and Thompson Sampling. Additionally, it covers reinforcement learning concepts, including policy-based and value-based methods, and notable breakthroughs in deep reinforcement learning applications.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Go to previous items Go to next items
Save Reinforcement Learning For Later
° e e
Markov De Cismemsmsmmsms (IVIDP)
An MDP is defined by 5-tuple : (S,A,R, P,y)
* Aset of states S
+ Agent's knowledge ofthe environment
* Aset of actions A
* Set of transition Probabilities P
Py(s,8") = Pri Siva = 5°15, = 6 00 = a) probability thot
bind from Sis at tne wil ede dts tine
* Set of rewards R
+ R(@,s): reward received from state softer taking action a
Select action to maximize E(Ro +
y-Ry+y?-Rp+-)
* y € (0,1] is the discount factor.
+ _Represents the importance of future rewards in comparison to
mearate rewardDiscount Factor
* Weare uncertain about future (e.g. predicting stock price after 2
years)
* Prefer to collect more rewards sooner (Be happy now than 5 years
later)
+ Mathematical convenienceExploration vs Exploitation Dilemma
* Online decision-making involves a fundamental choice
+ Exploration: Gather more information
+ Procure a new item
+ Exploitation: Make the best decision given current information
Procure the most profitable item
* The best long-term strategy might require short-term sacrifices
* Gather enough information to make the best overall decisions°e e be
Exploration vs Exploitation Dilemma
+ Restaurant Selection
+ Exploitation: Go to your favorite restaurant,
+ Exploration: Try a new restaurant
* Music
* Exploitation: Listen to your favorite song
* Exploration: Try a different song
* Display advertisement
* Exploitation: Show the most successful ad
+ Exploration: Show a new adMulti-armed bandits (MAB)
eh wp “Fi #° e be
The Multi-Arni@a Bandit (MAB) Problem
TS “ne ‘A
A multi-armed bandit is a tuple (A, R)
* Ais a known and finite set of actions (or
‘arms') 4,2," dy}
* Ra(r) = p(r | a) isan unknown probability
distribution over rewards
At each step t = 1, 2,
Agent selects action A; € A
Environment selects reward R; ~ Ra,
Goal: Maximize expected cumulative reward E
lo 100=k
Multi-armed Bandits and RL°e e cs
Multi-Armed Bancre=Reereeas
* Goal is to collect as much reward as possible.
+ Sum of rewards is not very informative - doesn’t tell us how far we are from the
optimum reward.
* Mean reward for action A,
Va = ie 1A] Total regret
T*Vy — DV
* Optimal arm [ ot.
At = argmax V4
"A
‘Maximum achievable Reward collected by
reword a bandit olgorithm
Minimize total regret = Maximize cumulative reward° e ts
Application : Advts
Coe maces en
on the Beach - Lowest price quaranee - booking com
¥ Of Hotel Bookings - [Link]
~[Link]
Beach Palace Resert Al Inclusive, Cancun - Hotel Bockna°e e e
Multi-armed banessmamzsmommGreedy
-
Random arm with probability ¢
7
action = A 1 ©
Best arm with probability 1~ ¢ wf iN
66-8
candéate for exploration
Production friendly: instantly convert a production
algorithm to bandit!
How to set the value of ¢? Standard Practice ind reduce its value over timeUpper Confidence Bound (UCB)
Initialization
+ Pull each arm at least once and observe the reward
Fou iteration t:
Let Na(t) > 1 be the number of times arm a has been pulled
Let Vq,z be the mean rewards collected from arm a
Pullthe arm a* = argmax (Vax + FEES) and collect the reward
Update Na-(t) and VaseUpper Con fitemmssamsxcreel (UCB)
Three arms: ay, 43,43
‘Arm a, (blue arm) has the
. highest UCB
Pe)
After pulling a; (blue arm)
+ Weare more certain about its value° e -
Th0 Nemec | 8
Bayesian approach to Multi-armed bandits
* Model for predicting rewards: p( Ra | 9q),Qq is unknown
* Posits prior distribution on parameters : p(6,)
* Compute posterior distribution of rewards p(8q | De)
* Dy = 4,1, 42,72°** Ap, % is the sequence of arms and rewards collected in the first ¢ trials
* Use the posterior distribution to guide exploration
* Improved performance if distributional assumptions are correct
lo soo° e oe
Thompson Samp reg
* Apply Bayes’ theorem to compute posterior distribution p(@q | Dt)
+ Sample a reward distribution 8, from posterior
+ Compute value of arm a, Vz = E[Ra]
* Select action to maximize value on sample, a; = argmax Va
a° e x
Multi-Armed Baneres==rerere”
Recommendations for You,e e
Multi-Armed Barl@te"Exariple
Modeling product recommendation as MAB
* Each product is an arm
+ Rewards are binary
* Purchase (labeled as 1) / no-purchase (labeled as 0)°e e %
Multi-Armed Bandits: Example
Use Beta distribution as prior°e e
Multi-Armed Bandits: Example
* Posterior distribution of rewards
P(A | De) © p(De | )=P(Ox) Bors’ theorem
likelihood prior
[Ne : Number of impressions of product x in the first ¢ time steps
Pz :Number of times x has been purchased in the first ¢ time steps
PC De | Ox) & 88 CL — 0) Nae PotMulti-Armed Bandit: Example
Initialization
+ Initialize counters Nz = Pr = 0, product x
For iteration t > 1:
+ Forall x, sample 0,,¢ ~ Beta(P, +a, Nx — Py +B)
+ Recommend the product x* = argmax (8,,.) and collect the reward
+ Update Ny and PyePersonalized Recommendation
+ One bancit problem per user!
+ Hard to train - need several rounds of experience
with same user
* Make use of context information
‘+ User demographic
+ Purchase and browsing history
* Context: day, time, location ete.
* Products share some features too!
+ Price, discount X= (Xu%Xa)
+ Genre Bau
eos fetes fetes°e e is
Contextual Bandit ===
A contextual bandit is a tuple
(S,A, R)
P[s] is an unknown distribution over states (or
“‘contexts’)
+ Aisa known set of actions (or ‘arms’)
* Rea(r) = P[r | s,a] isan unknown probability
distribution over rewards
At each step t
+ Environment generates context/state s- ~ S
+ Agent selects action a € A
+ Environment offers reward 7 ~ Repay
Goal: Maximize cumulative reward Y7_Epsilon-Greedy
* Good news: e-Greedy works without any change!
Select arma
that maximizes
FSB)
FO: your favorite
reese)Upper Confidenc@' sear"
* Learn @ through Ordinary Least Squares (OLS).
+ Training data= (Se, 42),Teeenzr
Mean reward from arma, G(s,a) = $(s,a)’ 6°
O° isthe OLS estimate
How do we transform it into a bandit algorithm?
* Which arm to explore?
Explore arms that we are least confident about.
* Capture confidence through variance of G(s,a)
Pull the arm a” = argmax (a0 t+c* |Var(qs, ®))Thompson Sampling?
Receive context x,
Draw 6¢ according to P(@|D)
Select a; = argmax, E,.(r|x1, a, 6°)
Observe reward r;
D=DU (21, a1,7%)
end for°e
Thompson Sampling?
D=0
Posterior inference can be
Receive context x;
Draw 0 according tXE(@]D) > intractable
Select a; = arg max, purchase/no purchase
+ P(r |5,,8 )isoften modeled as logistic
Observe reward 7 mers
D=DU (2, a1,%) + pOO1D,) doesthave any clsedform
end for
Use Gibbs sampling°e e os
Contextual Bandigem=aummerey
Epsilon-greedy
+ Easy to implement
+ Gives explicit control on what to explore
+ Tuning ¢ is more of an art
UCB
+ Computationally expensive : 0(k®) computation, k: feature dimension
+ May suffer from delayed feedback
Thompson Sampling
Drawing samples is computationally expensive but using them in decision making is less
expensive than UCB : O(c) computation
More resilient to delayed feedback
No need to tune exploration budget unlike epsilon-greedyReinforcement LearningReinforcement Learning
1. RL Concepts
2. Approaches to RL
3. Deep RLReinforcement Learning
State, Reward, Action,
Stimulus, Gain, Payoff, Response,
Situation Cost Control
(world)Policy
* Deterministic Policy
* Mapping of states to actions: a = (s)
* Agent takes an action: A, = m(S;)
* Stochastic Policy
+ w(a| s) isa probability distribution
+ Ar ~m(a|S,=s)
* Agent’s goal is to choose each action so as to maximize the
discounted sum of future rewards
* Resa + Resa + ¥7Re43t --- (y: discount factor)
+ Biko Revise
»Action-Value Function
Expected long term reward:
= * From state s
* With action a
* Under policy 1
Revs Rees * With discount factor y
an(S,4) = ElResa + YReaz + Y?Resat ---|S¢ = 5,Ae = @Arsrco~ M1]
rAction-Value Function
PSIs.a) —x(a'ts")
‘ Expected long term reward:
At * From state s
® * With action a
. wN * Under policy
Rea Reva * With discount factor
Qu (S,a) = E[Revi + VRev2 +? Reagt «++ | Se = 8, Ap = @ Actr:co~ T]
=2 p(s'ls.a) fr + y Dala'ts'Yae(s'.a')]Optimal Value Functions
* An optimal value function is the maximum achievable value
q.(s,a) = maxq,(s,a) for alls €S,a€ A(s)
©
* Once we have q. we can act optimally to derive optimal policy
m*(s) = argmaxq. (s,a)
aBellman Optimality Equation
Bellman Expectation Equation mets.)
an(s,a) = p(s'|s,a) [r +7 Laa'ls') qn(s',a’)] 6..o ~
(a's!)
qu(s,a) =% pG'ls,a) fr +y mbox q.(s',a'))Value-based RL
* Tabular Methods: In tasks with small, finite state sets value functions
can be approximated using tables (entry for each state/state-action
pair).
* Function Approximation: Take examples from a desired function (e.g.,
value function) and generalize frommthem to construct approximation
of the entire function.Q-Learning (Watkins & Dayan, 92]
+ A form of model-free method
* Agent doesn’t need to have model of the environment
* Iteratively approximates the optimal action-value function, q.
* Estimates are updated as agent gathers more experience
* Off-policy learning method
* Learns the value of its determin igedy (target) policy from experience
collected using exploration (behavior) policy
* Target policy is the policy being learned about
* Behavior policy is the policy generating the trajectory data
* Convergence: If all pairs (s, a) are continuously updated, q converges to
qG Learning
Experience: Sample sequences of states, actions and rewards from on-line or
simulated interaction with an environment!
Learned action-value function Q directly approximates the optimal action-value
function q.
Actual Gy bes Pe nG) amiuany toGembar Oo »
+ Repeat fogeach episode:
+ Initialize $
+ Repeat (for each step of the episode)
+ Choose A from S using policy derived from Q (eg., e- greedy)
+ Take action A, observe R, S’
+ Q(S,A) = a Q(S,A) + (1 a) (R + ymax Q(s',a))
easy
* Until S is terminalQ-Learning Update
+ What we want to compute (Bellman Optimality Eq.):
Q(S, A) = Ey [r + y max Q(s’,a’)]
2
* Using the sampling approach, we take a sample of the quantity under
expectation
r+ymax Q(s',a’)
"
* Consider a decayed approach where we combine old value with target
Q(S,A):= @ Q(S, A) + (1 — a) (7 + y max Q(S',a))
Q(S,A):= Q(S,A) + Aa) [r+ ymax Q(S',a) - QS,A)]Policy-based RL
Episode: t, ay mu
Policy: we © To © n
What is the expected return for an episode?
J(g) = E, (R()) P(E) = (50) - odo | $0) PCS4 | da) Moy I Si) PCSr | Spar r-1)
RO =n tht
= [cri 0)- RG) ae
Learn policy m* that maximizes the expected return
n* = argmax (m9)
toPolicy-based RL
Policy update equation momat+aA-Vo (me)
How to calculate the policy gradient Vo J(79)?
Policy gradient theorem
T
Vo J(m9) = Ex > Va loge (at Ist) - R(t)
=Policy based RL : REINFORCE
Monte Carlo approximation
Vo Jae) = Ex y Vo logs (ae Is) - weo| ~ D1 Vo logmee(ar Ise) - RC)
REINFORCE algorithm:
» 1. sample {7‘} from 79(a;|s;) (run the policy)
2. Vod (8) © YY; (Oy Vo log mo(aj|s})) (2, (st, a4)
3.0 04+ aVoJ(0)Q-learning vs. REINFORCE
Uses Bellman optimality condition to learn optimalQ Uses Monte Carlo approximation to learn the policy
function. Policy is derived implicitly from the Q function. directly.
Computational inefficient for large action spaces and _Difficult to converge due to slow rate of learning.
continuous actions.
Off-policy learning, On-policy learning
High bias due to the TD approximation step. High variance due to MC samplingBreakthroughs using Deep RL (DeepMind)
Playing Atari with Deep Reinforcement Learning (2013)
* Play Atari 2600 video games by just observing
screen pixel
+ Receiving reward when games score increased!
* Same model architecture, 7 different games
Human-level control through Deep Reinforcement
Learning (2015)
+ Featured on cover of Nature!
* Same model architecture, 49 different games,
expert human-level performanceBreakthroughs using Deep RL (DeepMind)
chess Shos!
AlphaZero (2017)
* Single system that taught itself from scratch to
master Chess, shogi (Japanese chess) and Go.
AlphaStar (2019)
+ First Al to reach top league of StarCraft Il, a real-
time strategy video game.
+ Achieved a Grandmaster level for all three
StarCraft II races: Protoss, Terran, and Zerg.
* Neural networks, self-play via RL, multi-agent
learning, imitation learning*
reward | 1,4Why Deep Learning?
* Take last 4 screen images, resize to 84 X 84 pixels
* Convert to grayscale with 256 gray levels
* Possible game states?
25684x84x4 1067970 1
*
* Represent the value function by a parameterized function
approximator (Deep Neural Network), with parameter w € R”™
* G(s,a; w)
Value Function ApproximationDL in a nutshell!
* A general purpose framework for representation learning:
* Given an objective
+ Learn representation required to achieve objective
* Directly from raw inputs
* Using minimal domain knowledgeQ-Learning Revisited
* Q-Learning update
Q(S, A) = E, Ir +y max Q(s', a a
* Treat right-hand side r + y max Q(s',a';w) asa tatget
* Minimize MSE loss by stochastic gradient descent
l= (" +ymax Q(s',a'3w) — Q(s,a; w)
* Converges to q,.using table look-up representationDeep Q-Learning: Q-network
Represent value function by Q-network with weights w
Q(s,a;w) = Q*(s,a)
[Link]) —Q{s,a,,w)Deep Q-Learning: Experience Replay
* To remove correlations, build data-set from agent’s own experience
51, a1, 12,52
92, a2, 13,53
53,83, 14, 54
Se, de, Fess Sea1 > [Se Aes Petty Sepa
* Sample experiences from data-set and apply update
Ls (r+ ymgxQ(s',asw-) — Qs,a; wy
+ For faster convergence, target parameters v~ held fixed.Deep Q-Learning: Algorithm
Initialize replay memory D
Initialize action-value function Q with random weight
Initialize target action-value function Q with weights @
for episode = I to M do
Initialize sequence s; = {-r1} and preprocessed sequence dy = 6(s1)
for t= 110 T do
‘random action with probability €
argmax, Q(¢(s;),a;0) otherwise
Execute action a; in emulator and observe reward ry and image 2141
Set 8141 = 845 4,211 and preprocess Ge-1 = O(8¢41)
Store transition (1, 2,7, 011) in D
// experience replay
Sample random minibatch of transitions (¢;,3,73,@)41) from D
sety, {7 if episode terminates at step j +1
Ye Vrs + -ymaxe O(dj41,0%507) otherwise
Perform a gradient descent step on (y; — Q(¢j, aj; ))? [Link]. the network parameter @
7 per update of target network
Every C steps reset Q = Q, ice., set 0~ = 0
Following greed pty, select ox ~ {DQN in AtariConclusion
* Multi-armed bandits (and contextual bandits)
* Epsilon-greedy
+ UCB
+ Thompson sampling
* Reinforcement learning
+ Qlearning
+ Deep QL