0% found this document useful (0 votes)

8 views66 pages

2025 - 07 - Approximate Policy Iteration

The document discusses Approximate Policy Iteration in reinforcement learning, focusing on the challenges of large or continuous state spaces where exact policy evaluation is infeasible. It outlines the procedure for policy iteration, including policy evaluation and improvement, and emphasizes the need for function approximation to generalize across states. Additionally, it touches on data generation methods for sampling state-action pairs to support the learning process.

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views66 pages

2025 - 07 - Approximate Policy Iteration

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Approximate Policy Iteration

Reinforcement Learning

Roberto Capobianco
Recap
Finite-Horizon MDPs

Slightly different formulation:

(S, A, R, T, H, 𝜇 0)
(time-horizon)
H ≥ 0 and s0 ~ 𝜇 0 (initial state distribution)

We consider time-dependent policies 𝜋

𝜋 = {𝜋0, 𝜋1, 𝜋2...𝜋H-1}
Actions might be different for the same state depending on t
Very common in control!
𝜋

Finite-Horizon MDP: V & Q

Vh𝜋(s) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]

where sh=s, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)

Qh𝜋(s,a) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]

where sh=s, ah=a, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)

𝜋

a
s s’
Finite-Horizon MDP: Bellman Equation

Qh𝜋(s,a) = r(s,a)+𝔼s’~p(.|s,a)[Vh+1𝜋(s’)]

𝜋
h h+1
a
𝜋
Qh s s’
Finding the Optimal Policy

𝜋* = {𝜋0, 𝜋1, 𝜋2...𝜋H-1}

Let’s reason backwards in time and apply dynamic

programming:

QH-1*(s,a) = r(s,a)

𝜋H-1*(s)=argmaxaQH-1*(s,a)

VH-1*(s)=maxaQH-1*(s,a)=QH-1*(s,𝜋H-1*(s))
Control Problems

So far, we assumed discrete state and action spaces, but

what about cartpole:

● state: angular pos & vel, linear

pos & vel
● action/control: force applied on
the cart
● goal: find the control policy
which minimizes the long term
cost c
Optimal Control

Given a dynamical system with a non-linear transition

function f, state x in ℝd and control u in ℝk, we want to find
a control policy 𝜋 such that

minimize 𝔼𝜋[cH(xH) + ⅀h=0H-1ch(xh,uh)]

where uh=𝜋(xh) and x0 ~ 𝜇0

Now this seems very familiar! Can we treat it as a

Finite-Horizon MDP
Bellman’s Curse of Dimensionality

● n-dimensional (discrete) state space

● The number of states grows exponentially in n

In practice discretization is useful, but it is only

computationally feasible up to 5 or 6 dimensional state
spaces
Let’s try to work directly in continuous space, starting
from simplified problems
Linear Systems

Consider a system of this kind:

This is our
xt+1=Axt+But
transition function!
● xt state at time t
● ut control (i.e., action) at time t

A in ℝdxd, B in ℝdxk
Quadratic Cost Function

Consider a cost function of this kind

c(xt,ut)=xtTQxt+utTRut alternative notation

g(xt,ut)
dxd kxk
● Q in ℝ and R in ℝ square matrices
● Q and R positive definite

As a result, there is a non-zero cost for any non-zero state

with all-zero control
LQR algorithm

● Initialize PH (at 0 or Q)
● Starting from h = H-1, backwards
○ Set Kh* = -(R+BTPh+1B)-1BTPh+1A
○ Compute u = 𝜋h*(xt) = Kh*xt
○ Set Ph = (Q + Kh*TRKh* + (A + BKh*)TPh+1(A + BKh*)) Riccati
○ Set Jh*= xtTPhxt Equation

This is the Value Iteration update done in closed form: it is

always the same and solves this particular continuous-state
system with a quadratic cost
LQR Extensions

Extensions to the LQR make it more generally applicable to:

● Affine systems
● Systems with stochasticity
● Regulation around non-zero fixed point for non-linear systems
● Trajectory following for non-linear systems
● ...
End Recap
Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

● Different from Value Iteration that was outputting values

Procedure:

1. Start with a random guess 𝜋0 (can be deterministic or stochastic)

2. For t=0,...,T: Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋(s’)]
a. Do policy evaluation and compute Q𝜋t for all s,a
b. Do policy improvement as 𝜋t+1=argmaxaQ𝜋t(s,a) for all s

This algorithm only makes progress, and the performance progress of the
policy is monotonic
State Visitation Probability

What’s the probability of visiting state s, a at time t

according to 𝜋 starting at s0?

d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)

What’s the probability of visiting state s, a at time t

according to 𝜋 starting at s0?
𝜋 Note that d𝜋s0 is an infinite
d s0
(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0) mixture
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)

What if the state-space is large and we cannot do exact or

iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or

iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large or continuous and we cannot

do exact or iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or

iterative policy evaluation for all states?

Assumptions: the (infinite-horizon) MDP is still known, but

the state-space is too large to just enumerate all states
and compute V𝜋(s)
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or

iterative policy evaluation for all states?

Assumptions:

(S, A, R, T, γ, 𝜇 0) is given

Q is in [0, 1/(1-γ)]
Approximate Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

Procedure:

1. Start with a random guess 𝜋0

2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a

Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
Approximate Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

Procedure:

1. Start with a random guess 𝜋0

2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a
Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
b. Do policy improvement as 𝜋t+1=argmaxaQ^𝜋t(s,a) for all s
argmax is still doable, we can still enumerate actions or
discretize them
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

optimal policy will be close-to-optimal
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

optimal policy will be close-to-optimal

Error bounds exist, but we will skip them for your happiness
:)
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

optimal policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize

among states and avoid looking at the whole S
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

optimal policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize

among states and avoid looking at the whole S

We use a function approximator

Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the optimal
policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize

among states and avoid looking at the whole S
We use a function approximator
e.g., linear approximators, neural nets, non-parametric, etc.
Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)
Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)

THIS IS WHAT WE ARE GONNA DO FROM NOW ON: MDP IS UNKNOWN!!!

Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

What do we need?
Data and Least Square Regression

To be fair, we can directly approximate Q, so let’s do that

What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to
extract a function Q in the family of functions
SxA -> [0, 1/(1-γ)]
Data and Least Square Regression

To be fair, we can directly approximate Q, so let’s do that

What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to extract a
function Q in the family of functions
Q: SxA -> [0, 1/(1-γ)]
argminQ ⅀ N(Q(si,ai)-yi)2
in Q i=1
Data and Least Square Regression

argminQ ⅀
in Q i=1
N
(Q(si,ai)-yi)2

This is just a regression problem, which

● is numerically tractable
● has generalization bounds
Supervised Learning Digression
Supervised Learning
Supervised Learning: Regression

Given a data distribution D from which we sample points xi

and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):

f^ = argminf^ in
⅀
F i=1
N
(f^(xi
)-yi
) 2

with F={f^: X->ℝ}

Supervised Learning: Regression

Given a data distribution D from which we sample points xi

and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):
Empirical f^ = argminf^ ⅀ N
(f^(x )-y ) 2
in F i=1 i i
Risk
Minimizer with F={f^: X->ℝ}

We can generalize under the same data distribution

𝔼x~D(f^(x)-f(x))2≤δ with δ small

Supervised Learning: Distribution Mismatch

𝔼x~D’(f^(x)-f(x))2 can be huge!

If D’≠D
End
Supervised Learning Digression
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change

We cannot guarantee anymore

monotonic improvement!
Oscillation from Distribution Change

Our estimation is only good under d𝜇0𝜋 and to make sure we

have monotonic improvement we need a strong coverage assumption
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

Sampling From Mixtures

2
p
p = (1-α)p1 + αp2

● Flip a coin with probability [α, 1-α]

● Commit to a specific pi based on that and sample from pi
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

ℙ𝜋h(s,a;s0)
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

ℙ𝜋h(s,a;s0)
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

ℙ𝜋h(s,a;s0)
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
● Follow 𝜋 for h timesteps starting from s0 ~ 𝜇0 and get
sh, ah
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋
(s’)]
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

How do we get an unbiased

estimate of this?
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Sample many times and average!

Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

𝜋
Sample many times and average!

a
s s’
s’
s’
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Easier said than done :(

Infinite horizon!
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Use γ as a sampling factor

again to choose an horizon
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given (s,a), how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Probability of termination
at next state (h=0)

2025 - 06 - LQR, iLQR, MPC
No ratings yet
2025 - 06 - LQR, iLQR, MPC
93 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
16 pages
Finite-Horizon MDPs and Value Iteration
No ratings yet
Finite-Horizon MDPs and Value Iteration
15 pages
MDP Planning Algorithms Overview
No ratings yet
MDP Planning Algorithms Overview
4 pages
2025 - 13 - Policy Search
No ratings yet
2025 - 13 - Policy Search
80 pages
Monte Carlo Learning in Control Systems
No ratings yet
Monte Carlo Learning in Control Systems
24 pages
Reinforcement Learning Cheat Sheet
No ratings yet
Reinforcement Learning Cheat Sheet
7 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
Reinforcement Learning Basics Explained
No ratings yet
Reinforcement Learning Basics Explained
15 pages
Reinforcement Learning and Control Andrew NG Vid Lecture 16-17
No ratings yet
Reinforcement Learning and Control Andrew NG Vid Lecture 16-17
15 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
7 pages
Markov Decision Processes Explained
No ratings yet
Markov Decision Processes Explained
38 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
17 pages
Least-Squares Control in Reinforcement Learning
No ratings yet
Least-Squares Control in Reinforcement Learning
12 pages
Reinforcement Learning Basics in ML
No ratings yet
Reinforcement Learning Basics in ML
60 pages
Chapter 15
No ratings yet
Chapter 15
37 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
19 pages
Reinforcement Learning: Function Approximation
No ratings yet
Reinforcement Learning: Function Approximation
21 pages
Dynamic Programming & Monte Carlo Methods
No ratings yet
Dynamic Programming & Monte Carlo Methods
65 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
17 pages
Markov Decision Processes in Traffic Control
No ratings yet
Markov Decision Processes in Traffic Control
31 pages
Value Iteration Challenges in MDPs
No ratings yet
Value Iteration Challenges in MDPs
88 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
33 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
58 pages
Lecture Joschka Mdps DP MC TD Q
No ratings yet
Lecture Joschka Mdps DP MC TD Q
60 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
84 pages
Dynamic Programming
No ratings yet
Dynamic Programming
100 pages
Markov Decision Processes in AI
No ratings yet
Markov Decision Processes in AI
17 pages
RL Lec6
No ratings yet
RL Lec6
51 pages
Dynamic Programming in Reinforcement Learning
No ratings yet
Dynamic Programming in Reinforcement Learning
63 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
Lecture 1 MDPs
No ratings yet
Lecture 1 MDPs
65 pages
Iterative Policy Evaluation in RL
No ratings yet
Iterative Policy Evaluation in RL
17 pages
Policy Optimization via Dynamic Programming
No ratings yet
Policy Optimization via Dynamic Programming
7 pages
Trust Region Policy Optimization (TRPO)
No ratings yet
Trust Region Policy Optimization (TRPO)
16 pages
Advanced Markov Decision Processes Notes
No ratings yet
Advanced Markov Decision Processes Notes
6 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
55 pages
2025 - 08 - Q-Learning, SARSA
No ratings yet
2025 - 08 - Q-Learning, SARSA
80 pages
Convergent TD Learning with Nonlinear Approximators
No ratings yet
Convergent TD Learning with Nonlinear Approximators
9 pages
4 Control Annotated
No ratings yet
4 Control Annotated
53 pages
Module 4 RL
No ratings yet
Module 4 RL
36 pages
Introduction to Approximate Dynamic Programming
No ratings yet
Introduction to Approximate Dynamic Programming
22 pages
Markov Decision Process Overview
No ratings yet
Markov Decision Process Overview
36 pages
Value Iteration in MDPs
No ratings yet
Value Iteration in MDPs
5 pages
Reinforcement Learning Overview and Examples
No ratings yet
Reinforcement Learning Overview and Examples
18 pages
Bellman Backup in Reinforcement Learning
No ratings yet
Bellman Backup in Reinforcement Learning
43 pages
MDP Cheatsheet: Key Concepts & Algorithms
No ratings yet
MDP Cheatsheet: Key Concepts & Algorithms
3 pages
Approximate Dynamic Programming Insights
No ratings yet
Approximate Dynamic Programming Insights
199 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
34 pages
Reinforcement Learning: Policy Gradient Methods
No ratings yet
Reinforcement Learning: Policy Gradient Methods
61 pages
Iterative Methods in Reinforcement Learning
No ratings yet
Iterative Methods in Reinforcement Learning
26 pages
Modern Classifiers in Reinforcement Learning
No ratings yet
Modern Classifiers in Reinforcement Learning
8 pages
Reinforcement Learning Without Models
No ratings yet
Reinforcement Learning Without Models
24 pages
Reinforcement Learning in MDPs Overview
No ratings yet
Reinforcement Learning in MDPs Overview
14 pages
Understanding Approximate Dynamic Programming
No ratings yet
Understanding Approximate Dynamic Programming
12 pages
MDPs and Value Iteration Explained
No ratings yet
MDPs and Value Iteration Explained
51 pages
Markov Decision Processes Explained
No ratings yet
Markov Decision Processes Explained
76 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
62 pages
Asynchronous Dynamic Programming in RL
No ratings yet
Asynchronous Dynamic Programming in RL
9 pages
Review of Traffic Control System Paper
No ratings yet
Review of Traffic Control System Paper
4 pages
Clustering in Python: Unsupervised Learning
No ratings yet
Clustering in Python: Unsupervised Learning
12 pages
Understanding Marginal Entropy
No ratings yet
Understanding Marginal Entropy
12 pages
Data Science Terminology Dictionary
No ratings yet
Data Science Terminology Dictionary
7 pages
Jmse 09 01020 v2
No ratings yet
Jmse 09 01020 v2
32 pages
Process Control & Instrumentation Course
No ratings yet
Process Control & Instrumentation Course
2 pages
Deep Learning for Indian Crop Price Forecasting
No ratings yet
Deep Learning for Indian Crop Price Forecasting
26 pages
Phases of Computer Problem Solving
No ratings yet
Phases of Computer Problem Solving
2 pages
Sorting Algorithm Performance Comparison
No ratings yet
Sorting Algorithm Performance Comparison
13 pages
Sugeno Fuzzy Logic Controllers Explained
No ratings yet
Sugeno Fuzzy Logic Controllers Explained
33 pages
Understanding Utility-Based AI Agents
No ratings yet
Understanding Utility-Based AI Agents
10 pages
1 s2.0 S0262885608001078 Main
No ratings yet
1 s2.0 S0262885608001078 Main
14 pages
Python Lab Manual for Grade XI Students
No ratings yet
Python Lab Manual for Grade XI Students
30 pages
Discrete Logarithm and Hash Functions in Cryptography
No ratings yet
Discrete Logarithm and Hash Functions in Cryptography
5 pages
Lab3 1
No ratings yet
Lab3 1
10 pages
Eigenvalues and Eigenvectors Explained
No ratings yet
Eigenvalues and Eigenvectors Explained
47 pages
NLP Applications and Text Classification
No ratings yet
NLP Applications and Text Classification
64 pages
Production Flow Analysis Techniques
No ratings yet
Production Flow Analysis Techniques
41 pages
Task-Agnostic Concept Discovery Model
No ratings yet
Task-Agnostic Concept Discovery Model
40 pages
Regression Analysis Question Bank
No ratings yet
Regression Analysis Question Bank
9 pages
Playtesting Match 3 Games with RL
No ratings yet
Playtesting Match 3 Games with RL
6 pages
Least Squares Method in Regression Analysis
No ratings yet
Least Squares Method in Regression Analysis
12 pages
Seasonal Weather Pattern Recognition
No ratings yet
Seasonal Weather Pattern Recognition
10 pages
Linear Inequations Worksheet Guide
No ratings yet
Linear Inequations Worksheet Guide
2 pages
Nonlinear Dynamics in Deep Linear Networks
No ratings yet
Nonlinear Dynamics in Deep Linear Networks
22 pages
Matrix Capsules with EM Routing
No ratings yet
Matrix Capsules with EM Routing
16 pages
Two-Way ANOVA and Heteroskedasticity Guide
No ratings yet
Two-Way ANOVA and Heteroskedasticity Guide
27 pages
TensorFlow Neural Network Guide
No ratings yet
TensorFlow Neural Network Guide
17 pages
Theory of Computation Exam Questions
No ratings yet
Theory of Computation Exam Questions
7 pages
Figure PPT ch005
No ratings yet
Figure PPT ch005
59 pages

2025 - 07 - Approximate Policy Iteration

Uploaded by

2025 - 07 - Approximate Policy Iteration

Uploaded by

Approximate Policy Iteration

Slightly different formulation:

We consider time-dependent policies 𝜋

Finite-Horizon MDP: V & Q

where sh=s, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)

where sh=s, ah=a, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)

𝜋* = {𝜋0*, 𝜋1*, 𝜋2*...𝜋H-1*}

Let’s reason backwards in time and apply dynamic

So far, we assumed discrete state and action spaces, but

● state: angular pos & vel, linear

Given a dynamical system with a non-linear transition

minimize 𝔼𝜋[cH(xH) + ⅀h=0H-1ch(xh,uh)]

where uh=𝜋(xh) and x0 ~ 𝜇0

Now this seems very familiar! Can we treat it as a

● n-dimensional (discrete) state space

In practice discretization is useful, but it is only

Consider a system of this kind:

Consider a cost function of this kind

c(xt,ut)=xtTQxt+utTRut alternative notation

As a result, there is a non-zero cost for any non-zero state

This is the Value Iteration update done in closed form: it is

Extensions to the LQR make it more generally applicable to:

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

1. Start with a random guess 𝜋0 (can be deterministic or stochastic)

What’s the probability of visiting state s, a at time t

What’s the probability of visiting state s, a at time t

What if the state-space is large and we cannot do exact or

What if the state-space is large and we cannot do exact or

What if the state-space is large or continuous and we cannot

What if the state-space is large and we cannot do exact or

Assumptions: the (infinite-horizon) MDP is still known, but

What if the state-space is large and we cannot do exact or

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

1. Start with a random guess 𝜋0

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

1. Start with a random guess 𝜋0

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

Approximation for large state-spaces is needed to generalize

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the

Approximation for large state-spaces is needed to generalize

We use a function approximator

We build an approximation V^𝜋 of the true value function V𝜋

Approximation for large state-spaces is needed to generalize

To be fair, we can directly approximate Q, so let’s do that

To be fair, we can directly approximate Q, so let’s do that

To be fair, we can directly approximate Q, so let’s do that

THIS IS WHAT WE ARE GONNA DO FROM NOW ON: MDP IS UNKNOWN!!!

To be fair, we can directly approximate Q, so let’s do that

To be fair, we can directly approximate Q, so let’s do that

To be fair, we can directly approximate Q, so let’s do that

This is just a regression problem, which

Given a data distribution D from which we sample points xi

with F={f^: X->ℝ}

Given a data distribution D from which we sample points xi

We can generalize under the same data distribution

𝔼x~D(f^(x)-f(x))2≤δ with δ small

𝔼x~D’(f^(x)-f(x))2 can be huge!

We cannot guarantee anymore

Our estimation is only good under d𝜇0𝜋 and to make sure we

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Flip a coin with probability [α, 1-α]

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific

Given s, a, how do we estimate Q𝜋(s,a)?

Given s, a, how do we estimate Q𝜋(s,a)?

Given s, a, how do we estimate Q𝜋(s,a)?

𝜋* = {𝜋0, 𝜋1, 𝜋2...𝜋H-1}