Machine Learning
Reinforcement Learning
(2)
Dr. Harry Goldingay
goldihj1@[Link]
Learning Outcomes
At the end of this lecture you should:
Understand the difference between active and passive
reinforcement learning and understand the role of
exploration in the former.
Understand two active reinforcement learning algorithms –
SARSA and Q-learning – well enough to implement them.
Be aware of other important approaches to reinforcement
learning.
Motivation
Aims of Reinforcement Learning
In the previous unit, we introduced passive reinforcement
learning…
Given a policy, how do we evaluate it?
…but when talking about MDPs more generally, we did better:
we could find optimal policies.
Can we do the same in a reinforcement learning setting?
Scenario 1
Our agent has a policy.
It acts in the environment following this policy.
It uses the information it observes to learn information about the
utility of the environmental states.
Based on this information, it calculates the optimal policy.
Any issues with this?
Scenario 2
Our agent has no policy – instead it acts randomly.
It acts in the environment.
It uses the information it observes to learn information about the
utility of the environmental states.
Based on this information, it calculates the optimal policy.
Any issues with this?
Which Slot Machine To Play?
How to Act in Grid World?
Active
Reinforcement
Learning
Greedy Algorithm
We were able to use the Bellman equations…
𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .
𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
Recap: Grid World: Utilities of States
For 𝛾 = 1
Recap: Policies and Utilities of States
If we know 𝑈 𝑠 for each state, we can use it to infer the optimal
policy.
Choose the action which results in the highest expected
utility of the next state.
What action should an agent take in state (3,1) in the previous
example?
Up: 0.8 × 0.66 + 0.1 × 0.655 + 0.388 ≈ 0.63
Left: 0.8 × 0.655 + 0.1 × 0.66 + 0.611 ≈ 0.65
The agent should choose Left.
Key takeaway: 𝜋 ∗ can be inferred directly from utility.
Greedy Agent
We were able to use the Bellman equations…
𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .
A greedy agent acts in the environment to estimate the required
quantities…
𝑈 𝑠
𝑅 𝑠
𝑃 𝑠 ′ 𝑠, 𝑎
… and then at each step chooses the action which maximises utility in
its current state.
Choice of action makes this an example of active reinforcement learning.
Greedy Agent: Multi-armed Bandit
Greedy Agent: Grid World
Greedy Agent: Grid World Policy
Exploration
We have seen in the previous examples that the greedy
algorithm is flawed.
Taking the current estimated optimal action restricts data
gathering – may mean that the optimal action is not
discovered…
…and that, even when the action from the optimal policy is
taken, it may not be optimal in the context of the rest of the
agent’s policy.
Greedy agents exploit their knowledge – we need an agent
which also explores.
𝜖-greedy Algorithm
We have already talked about an excellent way of exploring –
acting randomly in all states!
Can we get the advantages of the greedy and the random
approach?
Simple solution: 𝝐-greedy algorithm:
Act randomly with probability 𝜖 (some parameter of the
algorithm)
Otherwise act greedily
This algorithm makes the trade-off between exploration and
exploitation clear.
Balancing Exploration and Exploitation
Recall that, in the context of the multi-armed bandit problem, we said that:
As we get information, we don’t want to keep seemingly taking bad
options frequently…
…but a finite number of trials is never enough to be certain about the
result of a stochastic process – we are never done with exploration.
In theory we want our algorithms to be greedy in the limit with infinite
exploration (GLIE).
Infinite exploration: they should take all actions in all states an
unbounded number of times.
Greedy in the limit: approaches greedy behaviour arbitrarily closely as
time increases.
Simple example 𝝐𝒕 -greedy: agent acts randomly a fraction 1/𝑡.
Exploration Function
Our two previous algorithms explore, but don’t prioritise their exploration.
We can encourage exploration of unvisited states by optimistically
estimating their utility.
Consider an active version of the previously discussed value iteration algorithm
in which an agent acts greedily at each step based on optimistic estimated utility
𝑈 + 𝑠 which is updated as follows:
𝑈 + 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑓( 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈 + 𝑠 ′ , 𝑁(𝑠, 𝑎))
𝑎
𝑠′
Where:
𝑅+, 𝑖𝑓 𝑛 < 𝑁𝑒
𝑓 𝑢, 𝑛 = ቊ
𝑢, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
And where 𝑁(𝑠, 𝑎) is the count of times the agent has taken action 𝑎 in state 𝑠
and 𝑁𝑒 is a parameter of the algorithm.
Grid World: Exploration Function
SARSA and
Q-learning
TDL For Active Reinforcement Learning
Recall the update equation for TDL:
𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠
And our criterion for choosing an action based on the bellman
equations:
𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
What issues does this imply in an active reinforcement learning
setting?
The passive version of TDL does not give us enough information to
choose an action!
We would need also to know/learn the transition model 𝑃 𝑠 ′ 𝑠, 𝑎
Q-values
Can we adapt the ideas of TDL to give us a model-free approach to
active reinforcement learning?
One approach is not to learn the utilities of states 𝑈(𝑠), but instead to
learn Q-values 𝑄(𝑠, 𝑎) – the expected utility of taking action 𝑎 in state 𝑠.
Note that, given Q-values it is easy to recover state utilities:
𝑈 𝑠 = max 𝑄(𝑎, 𝑠)
𝑎
and to greedily choose actions:
𝜋 ∗ 𝑠 = argmax 𝑄(𝑎, 𝑠)
𝑎
SARSA
How to learn Q-values? How to act on them?
In SARSA (State-Action-Reward-State-Action) an agent has some
policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).
In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.
In state 𝑠′ it takes action 𝑎′.
It updates Q-values as follows:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎
The updated Q-values may cause the policy to change.
Q-Learning
Q-learning is a related approach.
An agent has some policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).
In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.
It updates Q-values as follows:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max
′
𝑄(𝑠 ′ , 𝑎′ ) − 𝑄 𝑠, 𝑎
𝑎
The updated Q-values may cause the policy to change.
TDL vs SARSA vs Q-learning
Compare the update equations.
Passive Temporal Difference Learning:
𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠
Expected utility
based on action
SARSA: from policy
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎
Q-learning:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max 𝑄(𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑠, 𝑎
′ 𝑎
Expected utility based
on greedy policy
SARSA and Q-Learning Compared
−106
?
? ? ?
SARSA and Q-Learning Compared
If we knew the true Q-values, then behaving greedily would be
optimal…
Q-learning tries to learn this optimal policy regardless of the actual
policy the agent is using.
It is an off-policy learning algorithm.
…however, if an agent is not following a greedy policy (e.g. because it
is exploring) then basing actions on Q-values for a greedy algorithm
can cause the agent to act suboptimally.
SARSA instead tries to learn the optimal Q-values for its current
policy.
It is an on-policy learning algorithm.
…and can cause issues with convergence.
SARSA and Q-learning Illustrated
𝑎1 𝟏. 𝟎
𝑠0
𝑠1
𝟏. 𝟎
𝑎0
𝟎. 𝟓
𝟎. 𝟏
𝑎0 𝑎1
𝟎. 𝟓
𝟎. 𝟗
𝑅 𝑠0 = −0.50
𝑠2
𝑅 𝑠1 = −0.75
𝑅 𝑠2 = −0.10
SARSA and Q-learning Illustrated
Imagine that we have run one of SARSA or Q-learning with an 𝜖-greedy policy
and obtained the following estimates of Q-values
𝑄(𝒂, 𝒔) 𝑠0 𝑠𝟏
𝑎0 -0.8 -1.35
𝑎1 -0.7 -0.85
What is the greedy policy?
Take action 𝑎1 in state 𝑠0 and action 𝑎1 in state 𝑠1 .
Let us assume that the agent starts in state 𝑠0 and chooses to take action 𝑎1 ,
receiving reward -0.5 and transitioning to state 𝑠1 .
For the purposes of the SARSA update, we will assume that, in this state, it
chooses action 𝑎0 (exploration).
SARSA and Q-learning Illustrated
Sequence: 𝑠0 , 𝑎1 −0.5 → {𝑠1 , 𝑎0 }
The agent has taken action 𝑎1 in state 𝑠0 so must update
𝑄(𝑠0 , 𝑎1 ).
We will set 𝛼 = 0.1 and 𝛾 = 0.75
Action chosen
in state 𝑠1
SARSA:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾𝑄 𝑠1 , 𝑎0 − 𝑄 𝑠0 , 𝑎1
𝑄 𝑠0 , 𝑎1 = −0.7 + 0.1 −0.5 − 0.75 × 1.35 + 0.7 = −0.78
Best action in state 𝑠1
Q-learning:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾 max 𝑄(𝑠 , 𝑎 ′) − 𝑄 𝑠 , 𝑎
′ 1 0 1
𝑎
𝑄 𝑠, 𝑎 = −0.7 + 0.1 −0.5 − 0.75 × 0.85 + 0.7 = −0.74
Generalisation
Scaling Up
Scaling Up
The approaches we have discussed so far depend on
estimating quantities per state…
𝑄(𝑠, 𝑎)
…or even quadratic in the numbers of states.
𝑃(𝑠’|𝑠, 𝑎)
We need (multiple) samples for each state.
Chess has been estimated to have 1040 board states.
We need some way to generalise from data we have gathered
about states we have visited.
Genralisation – The Problem
Assume that we want to generalise Q-values, then:
We have a series of state-action pairs
[ 𝜎1 , 𝛼1 , 𝜎2 , 𝛼2 , … , 𝜎𝑁 , 𝛼𝑁 ]
For each state-action pair we have a sample of the
corresponding Q-value
[𝑄 𝜎1 , 𝛼1 , 𝑄 𝜎2 , 𝛼2 , … , 𝑄 𝜎𝑁 , 𝛼𝑁 ]
We want to predict the Q-value function for unseen state-
action pairs:
𝑄(𝜎, 𝛼)
Function Approximation
The problem we have just described is just a supervised
learning problem.
We choose some model to approximate our Q-value function
𝑄𝜽 (𝑠, 𝑎)
We want to update our parameters 𝜽 to minimise some error
We will use sum-of-squares
We can use on-line learning to update the model parameters
after each trial:
Other Approaches
Dynamic Programming
We have focussed primarily on model-free approaches to
reinforcement learning (temporal difference approaches)…
…but model-based approaches, such as adaptive dynamic
programming are also successful:
Learn the model by acting in the environment
Use offline approaches (e.g. value iteration) to infer optimal policy.
These approaches often converge faster than model-free
approaches…
…but don’t scale as well.
Policy Search
We have introduced ways of constructing policies from
parameters (e.g. Q-values).
We can estimate the quality of a policy by running trials of an
agent with that policy in the environment and measuring
observed reward.
We want to maximise the quality of the policy.
This is an optimization problem! Can apply tools such as
gradient descent.
Conclusion
You should now:
Understand the difference between active and passive
reinforcement learning and understand the role of
exploration in the former.
Understand two active reinforcement learning algorithms –
SARSA and Q-learning – well enough to implement them.
Be aware of other important approaches to reinforcement
learning.