0% found this document useful (0 votes)

17 views41 pages

Active Reinforcement Learning Overview

Uploaded by

Tùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views41 pages

Active Reinforcement Learning Overview

Uploaded by

Tùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning

Reinforcement Learning
(2)

Dr. Harry Goldingay

goldihj1@[Link]
Learning Outcomes

At the end of this lecture you should:

Understand the difference between active and passive

reinforcement learning and understand the role of
exploration in the former.

Understand two active reinforcement learning algorithms –

SARSA and Q-learning – well enough to implement them.

Be aware of other important approaches to reinforcement

learning.
Motivation
Aims of Reinforcement Learning

In the previous unit, we introduced passive reinforcement

learning…
Given a policy, how do we evaluate it?

…but when talking about MDPs more generally, we did better:

we could find optimal policies.

Can we do the same in a reinforcement learning setting?

Scenario 1

Our agent has a policy.

It acts in the environment following this policy.

It uses the information it observes to learn information about the

utility of the environmental states.

Based on this information, it calculates the optimal policy.

Any issues with this?

Scenario 2

Our agent has no policy – instead it acts randomly.

It acts in the environment.

It uses the information it observes to learn information about the

utility of the environmental states.

Based on this information, it calculates the optimal policy.

Any issues with this?

Which Slot Machine To Play?
How to Act in Grid World?
Active
Reinforcement
Learning
Greedy Algorithm

We were able to use the Bellman equations…

𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .

𝜋 ∗ 𝑠 = argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
Recap: Grid World: Utilities of States

For 𝛾 = 1
Recap: Policies and Utilities of States

If we know 𝑈 𝑠 for each state, we can use it to infer the optimal

policy.
Choose the action which results in the highest expected
utility of the next state.

What action should an agent take in state (3,1) in the previous

example?
Up: 0.8 × 0.66 + 0.1 × 0.655 + 0.388 ≈ 0.63
Left: 0.8 × 0.655 + 0.1 × 0.66 + 0.611 ≈ 0.65
The agent should choose Left.

Key takeaway: 𝜋 ∗ can be inferred directly from utility.

Greedy Agent

We were able to use the Bellman equations…

𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .

A greedy agent acts in the environment to estimate the required

quantities…
𝑈 𝑠
𝑅 𝑠
𝑃 𝑠 ′ 𝑠, 𝑎
… and then at each step chooses the action which maximises utility in
its current state.

Choice of action makes this an example of active reinforcement learning.

Greedy Agent: Multi-armed Bandit
Greedy Agent: Grid World
Greedy Agent: Grid World Policy
Exploration

We have seen in the previous examples that the greedy

algorithm is flawed.

Taking the current estimated optimal action restricts data

gathering – may mean that the optimal action is not
discovered…

…and that, even when the action from the optimal policy is
taken, it may not be optimal in the context of the rest of the
agent’s policy.

Greedy agents exploit their knowledge – we need an agent

which also explores.
𝜖-greedy Algorithm

We have already talked about an excellent way of exploring –

acting randomly in all states!

Can we get the advantages of the greedy and the random

approach?

Simple solution: 𝝐-greedy algorithm:

Act randomly with probability 𝜖 (some parameter of the
algorithm)
Otherwise act greedily

This algorithm makes the trade-off between exploration and

exploitation clear.
Balancing Exploration and Exploitation

Recall that, in the context of the multi-armed bandit problem, we said that:
As we get information, we don’t want to keep seemingly taking bad
options frequently…
…but a finite number of trials is never enough to be certain about the
result of a stochastic process – we are never done with exploration.

In theory we want our algorithms to be greedy in the limit with infinite

exploration (GLIE).
Infinite exploration: they should take all actions in all states an
unbounded number of times.
Greedy in the limit: approaches greedy behaviour arbitrarily closely as
time increases.
Simple example 𝝐𝒕 -greedy: agent acts randomly a fraction 1/𝑡.
Exploration Function

Our two previous algorithms explore, but don’t prioritise their exploration.
We can encourage exploration of unvisited states by optimistically
estimating their utility.

Consider an active version of the previously discussed value iteration algorithm

in which an agent acts greedily at each step based on optimistic estimated utility
𝑈 + 𝑠 which is updated as follows:

𝑈 + 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑓(෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈 + 𝑠 ′ , 𝑁(𝑠, 𝑎))

𝑎
𝑠′
Where:
𝑅+, 𝑖𝑓 𝑛 < 𝑁𝑒
𝑓 𝑢, 𝑛 = ቊ
𝑢, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

And where 𝑁(𝑠, 𝑎) is the count of times the agent has taken action 𝑎 in state 𝑠
and 𝑁𝑒 is a parameter of the algorithm.
Grid World: Exploration Function
SARSA and
Q-learning
TDL For Active Reinforcement Learning

Recall the update equation for TDL:

𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠

And our criterion for choosing an action based on the bellman

equations:

𝜋 ∗ 𝑠 = argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′

What issues does this imply in an active reinforcement learning

setting?
The passive version of TDL does not give us enough information to
choose an action!
We would need also to know/learn the transition model 𝑃 𝑠 ′ 𝑠, 𝑎
Q-values

Can we adapt the ideas of TDL to give us a model-free approach to

active reinforcement learning?

One approach is not to learn the utilities of states 𝑈(𝑠), but instead to
learn Q-values 𝑄(𝑠, 𝑎) – the expected utility of taking action 𝑎 in state 𝑠.

Note that, given Q-values it is easy to recover state utilities:

𝑈 𝑠 = max 𝑄(𝑎, 𝑠)
𝑎
and to greedily choose actions:

𝜋 ∗ 𝑠 = argmax 𝑄(𝑎, 𝑠)
𝑎
SARSA

How to learn Q-values? How to act on them?

In SARSA (State-Action-Reward-State-Action) an agent has some

policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).

In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.

In state 𝑠′ it takes action 𝑎′.

It updates Q-values as follows:

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎

The updated Q-values may cause the policy to change.

Q-Learning

Q-learning is a related approach.

An agent has some policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).

In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.

It updates Q-values as follows:

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max
′
𝑄(𝑠 ′ , 𝑎′ ) − 𝑄 𝑠, 𝑎
𝑎

The updated Q-values may cause the policy to change.

TDL vs SARSA vs Q-learning

Compare the update equations.

Passive Temporal Difference Learning:

𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠
Expected utility
based on action
SARSA: from policy
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎

Q-learning:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max 𝑄(𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑠, 𝑎
′ 𝑎

Expected utility based

on greedy policy
SARSA and Q-Learning Compared

−106

?
? ? ?
SARSA and Q-Learning Compared

If we knew the true Q-values, then behaving greedily would be

optimal…
Q-learning tries to learn this optimal policy regardless of the actual
policy the agent is using.
It is an off-policy learning algorithm.

…however, if an agent is not following a greedy policy (e.g. because it

is exploring) then basing actions on Q-values for a greedy algorithm
can cause the agent to act suboptimally.
SARSA instead tries to learn the optimal Q-values for its current
policy.
It is an on-policy learning algorithm.

…and can cause issues with convergence.

SARSA and Q-learning Illustrated

𝑎1 𝟏. 𝟎

𝑠0
𝑠1
𝟏. 𝟎
𝑎0

𝟎. 𝟓
𝟎. 𝟏
𝑎0 𝑎1
𝟎. 𝟓
𝟎. 𝟗
𝑅 𝑠0 = −0.50
𝑠2
𝑅 𝑠1 = −0.75
𝑅 𝑠2 = −0.10
SARSA and Q-learning Illustrated

Imagine that we have run one of SARSA or Q-learning with an 𝜖-greedy policy
and obtained the following estimates of Q-values

𝑄(𝒂, 𝒔) 𝑠0 𝑠𝟏
𝑎0 -0.8 -1.35
𝑎1 -0.7 -0.85

What is the greedy policy?

Take action 𝑎1 in state 𝑠0 and action 𝑎1 in state 𝑠1 .

Let us assume that the agent starts in state 𝑠0 and chooses to take action 𝑎1 ,
receiving reward -0.5 and transitioning to state 𝑠1 .

For the purposes of the SARSA update, we will assume that, in this state, it
chooses action 𝑎0 (exploration).
SARSA and Q-learning Illustrated

Sequence: 𝑠0 , 𝑎1 −0.5 → {𝑠1 , 𝑎0 }

The agent has taken action 𝑎1 in state 𝑠0 so must update
𝑄(𝑠0 , 𝑎1 ).
We will set 𝛼 = 0.1 and 𝛾 = 0.75
Action chosen
in state 𝑠1
SARSA:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾𝑄 𝑠1 , 𝑎0 − 𝑄 𝑠0 , 𝑎1
𝑄 𝑠0 , 𝑎1 = −0.7 + 0.1 −0.5 − 0.75 × 1.35 + 0.7 = −0.78

Best action in state 𝑠1

Q-learning:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾 max 𝑄(𝑠 , 𝑎 ′) − 𝑄 𝑠 , 𝑎
′ 1 0 1
𝑎
𝑄 𝑠, 𝑎 = −0.7 + 0.1 −0.5 − 0.75 × 0.85 + 0.7 = −0.74
Generalisation
Scaling Up
Scaling Up

The approaches we have discussed so far depend on

estimating quantities per state…
𝑄(𝑠, 𝑎)

…or even quadratic in the numbers of states.

𝑃(𝑠’|𝑠, 𝑎)

We need (multiple) samples for each state.

Chess has been estimated to have 1040 board states.

We need some way to generalise from data we have gathered

about states we have visited.
Genralisation – The Problem

Assume that we want to generalise Q-values, then:

We have a series of state-action pairs

[ 𝜎1 , 𝛼1 , 𝜎2 , 𝛼2 , … , 𝜎𝑁 , 𝛼𝑁 ]

For each state-action pair we have a sample of the

corresponding Q-value
[𝑄 𝜎1 , 𝛼1 , 𝑄 𝜎2 , 𝛼2 , … , 𝑄 𝜎𝑁 , 𝛼𝑁 ]

We want to predict the Q-value function for unseen state-

action pairs:
𝑄(𝜎, 𝛼)
Function Approximation

The problem we have just described is just a supervised

learning problem.

We choose some model to approximate our Q-value function

𝑄෠𝜽 (𝑠, 𝑎)

We want to update our parameters 𝜽 to minimise some error

We will use sum-of-squares

We can use on-line learning to update the model parameters

after each trial:
Other Approaches
Dynamic Programming

We have focussed primarily on model-free approaches to

reinforcement learning (temporal difference approaches)…

…but model-based approaches, such as adaptive dynamic

programming are also successful:
Learn the model by acting in the environment
Use offline approaches (e.g. value iteration) to infer optimal policy.

These approaches often converge faster than model-free

approaches…

…but don’t scale as well.

Policy Search

We have introduced ways of constructing policies from

parameters (e.g. Q-values).

We can estimate the quality of a policy by running trials of an

agent with that policy in the environment and measuring
observed reward.

We want to maximise the quality of the policy.

This is an optimization problem! Can apply tools such as

gradient descent.
Conclusion

You should now:

Understand the difference between active and passive

reinforcement learning and understand the role of
exploration in the former.

Understand two active reinforcement learning algorithms –

SARSA and Q-learning – well enough to implement them.

Be aware of other important approaches to reinforcement

learning.

Expected SARSA in Reinforcement Learning
No ratings yet
Expected SARSA in Reinforcement Learning
4 pages
Reinforcement Learning with Latent Confounding
No ratings yet
Reinforcement Learning with Latent Confounding
7 pages
Temporal Difference Learning in RL
No ratings yet
Temporal Difference Learning in RL
7 pages
Module-3 Part-2
No ratings yet
Module-3 Part-2
41 pages
Reinforcement Learning Overview in AI
No ratings yet
Reinforcement Learning Overview in AI
9 pages
Q Learing
No ratings yet
Q Learing
32 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
16 pages
CS 188 Introduction To Artificial Intelligence Spring 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Spring 2019 Note 4
9 pages
Understanding Q-Learning in RL
No ratings yet
Understanding Q-Learning in RL
30 pages
Reinforcement Learning in AI
No ratings yet
Reinforcement Learning in AI
9 pages
Understanding SARSA and Q-Learning
No ratings yet
Understanding SARSA and Q-Learning
4 pages
MLDL1 Lecture 26
No ratings yet
MLDL1 Lecture 26
50 pages
MDP and Q-Learning Algorithms Explained
No ratings yet
MDP and Q-Learning Algorithms Explained
19 pages
Reinforcement Learning for Robot Walking
No ratings yet
Reinforcement Learning for Robot Walking
9 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
16 pages
Reinforcement Learning Techniques Overview
No ratings yet
Reinforcement Learning Techniques Overview
33 pages
Reinforcement Learning Overview by Lucibello
No ratings yet
Reinforcement Learning Overview by Lucibello
56 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
58 pages
Reinforcement Learning: Policy Search Methods
No ratings yet
Reinforcement Learning: Policy Search Methods
22 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Intro To RL
No ratings yet
Intro To RL
35 pages
Deep Reinforcement Learning in Finance
No ratings yet
Deep Reinforcement Learning in Finance
40 pages
DHDH
No ratings yet
DHDH
51 pages
cs188 Fa23 Note10
No ratings yet
cs188 Fa23 Note10
8 pages
09 RL
No ratings yet
09 RL
58 pages
TD Learning Ch6
No ratings yet
TD Learning Ch6
15 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
5 pages
Exploration vs. Exploitation in AI
No ratings yet
Exploration vs. Exploitation in AI
2 pages
Filippov Theory in Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory in Epsilon-Greedy Q-Learning
66 pages
Passive Reinforcement Learning Overview
No ratings yet
Passive Reinforcement Learning Overview
43 pages
Rec8 Sol
No ratings yet
Rec8 Sol
7 pages
Convergence Analysis in Model-Free Control
No ratings yet
Convergence Analysis in Model-Free Control
23 pages
06 Reinforcement Learning
No ratings yet
06 Reinforcement Learning
12 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
Reinforcement Learning Techniques Explained
No ratings yet
Reinforcement Learning Techniques Explained
14 pages
Q-Learning and TD Learning Explained
No ratings yet
Q-Learning and TD Learning Explained
6 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
57 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
25 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
17 pages
Reinforcement Learning Explained
No ratings yet
Reinforcement Learning Explained
45 pages
Reinforcement Learning in MDPs Explained
No ratings yet
Reinforcement Learning in MDPs Explained
7 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
34 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
44 pages
CS 188 Introduction To Artificial Intelligence Fall 2023 Note 11
No ratings yet
CS 188 Introduction To Artificial Intelligence Fall 2023 Note 11
2 pages
Overview of Reinforcement Learning Concepts
100% (1)
Overview of Reinforcement Learning Concepts
8 pages
Reinforcement Learning Overview Guide
No ratings yet
Reinforcement Learning Overview Guide
12 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
13 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
4 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
18 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
23 pages
Model-Free Control in Reinforcement Learning
No ratings yet
Model-Free Control in Reinforcement Learning
199 pages
Reinforcement Learning Insights by Khare
No ratings yet
Reinforcement Learning Insights by Khare
52 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
5 pages
SE - 02 Project Support Foundations W CRM
No ratings yet
SE - 02 Project Support Foundations W CRM
49 pages
Bitcoin Pizza Day Trading Competition
No ratings yet
Bitcoin Pizza Day Trading Competition
6 pages
Voxel X Network CEO Tim's Insights
No ratings yet
Voxel X Network CEO Tim's Insights
9 pages
Introduction to Mathematics for AI
No ratings yet
Introduction to Mathematics for AI
10 pages
NYC AI Action Plan 2023 Overview
No ratings yet
NYC AI Action Plan 2023 Overview
51 pages
Northwestern Graduate Engineering Overview
No ratings yet
Northwestern Graduate Engineering Overview
21 pages
KardiaChain: Vietnam's Blockchain Leader
No ratings yet
KardiaChain: Vietnam's Blockchain Leader
25 pages
Review of Adaptive Gamma Correction Methods
No ratings yet
Review of Adaptive Gamma Correction Methods
5 pages
The Application of Microsoft Solution Framework Software Testing Using Neutrosophic Numbers
No ratings yet
The Application of Microsoft Solution Framework Software Testing Using Neutrosophic Numbers
10 pages
Essential Linux Filter Commands
No ratings yet
Essential Linux Filter Commands
10 pages
Branch Cuts in Complex Functions
No ratings yet
Branch Cuts in Complex Functions
24 pages
Customer Experience Intelligence 8-18-2025
No ratings yet
Customer Experience Intelligence 8-18-2025
30 pages
COMPUTE!'s TI Collection Volume One
100% (1)
COMPUTE!'s TI Collection Volume One
322 pages
Robot Face Design Activity Guide
No ratings yet
Robot Face Design Activity Guide
2 pages
ASC301 Remote Operation Guide
No ratings yet
ASC301 Remote Operation Guide
28 pages
AVerGC311-圆钢GC311 使用说明书
No ratings yet
AVerGC311-圆钢GC311 使用说明书
20 pages
Configuring ACOS Virtual Chassis Systems
No ratings yet
Configuring ACOS Virtual Chassis Systems
61 pages
Team 07 PPT
No ratings yet
Team 07 PPT
11 pages
Form2 Zimsec
No ratings yet
Form2 Zimsec
9 pages
350 Laser Machine Operating Manual
No ratings yet
350 Laser Machine Operating Manual
109 pages
Building Full Stack LLM Applications
No ratings yet
Building Full Stack LLM Applications
39 pages
DBMS Lab SQL Queries for Employee Data
No ratings yet
DBMS Lab SQL Queries for Employee Data
4 pages
Ieee Quick Process
No ratings yet
Ieee Quick Process
1 page
Acadia Information Release Form
No ratings yet
Acadia Information Release Form
1 page
Ethical Considerations in MIS Design
No ratings yet
Ethical Considerations in MIS Design
3 pages
Overview of Metadata Standards
100% (1)
Overview of Metadata Standards
4 pages
Forecasting Techniques and Models Guide
No ratings yet
Forecasting Techniques and Models Guide
19 pages
Mazak QTE-200 SG Features & Specs
No ratings yet
Mazak QTE-200 SG Features & Specs
2 pages
Excavation Estimation Using PlanSwift
No ratings yet
Excavation Estimation Using PlanSwift
12 pages
Invoice for Data Network Solutions
No ratings yet
Invoice for Data Network Solutions
1 page
Names and Bindings in Programming
No ratings yet
Names and Bindings in Programming
19 pages
Longest Job First & SRTF Program Code
No ratings yet
Longest Job First & SRTF Program Code
4 pages
Mechanical Engineering Syllabus
No ratings yet
Mechanical Engineering Syllabus
171 pages
ITSEA0-44 Project 1 Marksheet 2025
No ratings yet
ITSEA0-44 Project 1 Marksheet 2025
4 pages
Overview of Business Statistics Concepts
No ratings yet
Overview of Business Statistics Concepts
19 pages
Alarm Summary for Communication Issues
No ratings yet
Alarm Summary for Communication Issues
1 page
Finding the Nth Term in Sequences
No ratings yet
Finding the Nth Term in Sequences
22 pages