0% found this document useful (0 votes)
8 views66 pages

2025 - 07 - Approximate Policy Iteration

The document discusses Approximate Policy Iteration in reinforcement learning, focusing on the challenges of large or continuous state spaces where exact policy evaluation is infeasible. It outlines the procedure for policy iteration, including policy evaluation and improvement, and emphasizes the need for function approximation to generalize across states. Additionally, it touches on data generation methods for sampling state-action pairs to support the learning process.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views66 pages

2025 - 07 - Approximate Policy Iteration

The document discusses Approximate Policy Iteration in reinforcement learning, focusing on the challenges of large or continuous state spaces where exact policy evaluation is infeasible. It outlines the procedure for policy iteration, including policy evaluation and improvement, and emphasizes the need for function approximation to generalize across states. Additionally, it touches on data generation methods for sampling state-action pairs to support the learning process.

Uploaded by

Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Approximate Policy Iteration

Reinforcement Learning

Roberto Capobianco
Recap
Finite-Horizon MDPs

Slightly different formulation:


(S, A, R, T, H, 𝜇 0)
(time-horizon)
H ≥ 0 and s0 ~ 𝜇 0 (initial state distribution)

We consider time-dependent policies 𝜋


𝜋 = {𝜋0, 𝜋1, 𝜋2...𝜋H-1}
Actions might be different for the same state depending on t
Very common in control!
𝜋

Finite-Horizon MDP: V & Q


s

Vh𝜋(s) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]

where sh=s, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)

Qh𝜋(s,a) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]

where sh=s, ah=a, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)


𝜋

a
s s’
Finite-Horizon MDP: Bellman Equation

Qh𝜋(s,a) = r(s,a)+𝔼s’~p(.|s,a)[Vh+1𝜋(s’)]

𝜋
h h+1
a
𝜋
Qh s s’
Finding the Optimal Policy

𝜋* = {𝜋0*, 𝜋1*, 𝜋2*...𝜋H-1*}

Let’s reason backwards in time and apply dynamic


programming:

QH-1*(s,a) = r(s,a)

𝜋H-1*(s)=argmaxaQH-1*(s,a)

VH-1*(s)=maxaQH-1*(s,a)=QH-1*(s,𝜋H-1*(s))
Control Problems

So far, we assumed discrete state and action spaces, but


what about cartpole:

● state: angular pos & vel, linear


pos & vel
● action/control: force applied on
the cart
● goal: find the control policy
which minimizes the long term
cost c
Optimal Control

Given a dynamical system with a non-linear transition


function f, state x in ℝd and control u in ℝk, we want to find
a control policy 𝜋 such that

minimize 𝔼𝜋[cH(xH) + ⅀h=0H-1ch(xh,uh)]

where uh=𝜋(xh) and x0 ~ 𝜇0

Now this seems very familiar! Can we treat it as a


Finite-Horizon MDP
Bellman’s Curse of Dimensionality

● n-dimensional (discrete) state space


● The number of states grows exponentially in n

In practice discretization is useful, but it is only


computationally feasible up to 5 or 6 dimensional state
spaces
Let’s try to work directly in continuous space, starting
from simplified problems
Linear Systems

Consider a system of this kind:


This is our
xt+1=Axt+But
transition function!
● xt state at time t
● ut control (i.e., action) at time t

A in ℝdxd, B in ℝdxk
Quadratic Cost Function

Consider a cost function of this kind

c(xt,ut)=xtTQxt+utTRut alternative notation


g(xt,ut)
dxd kxk
● Q in ℝ and R in ℝ square matrices
● Q and R positive definite

As a result, there is a non-zero cost for any non-zero state


with all-zero control
LQR algorithm

● Initialize PH (at 0 or Q)
● Starting from h = H-1, backwards
○ Set Kh* = -(R+BTPh+1B)-1BTPh+1A
○ Compute u = 𝜋h*(xt) = Kh*xt
○ Set Ph = (Q + Kh*TRKh* + (A + BKh*)TPh+1(A + BKh*)) Riccati
○ Set Jh*= xtTPhxt Equation

This is the Value Iteration update done in closed form: it is


always the same and solves this particular continuous-state
system with a quadratic cost
LQR Extensions

Extensions to the LQR make it more generally applicable to:


● Affine systems
● Systems with stochasticity
● Regulation around non-zero fixed point for non-linear systems
● Trajectory following for non-linear systems
● ...
End Recap
Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}


● Different from Value Iteration that was outputting values

Procedure:

1. Start with a random guess 𝜋0 (can be deterministic or stochastic)


2. For t=0,...,T: Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋(s’)]
a. Do policy evaluation and compute Q𝜋t for all s,a
b. Do policy improvement as 𝜋t+1=argmaxaQ𝜋t(s,a) for all s

This algorithm only makes progress, and the performance progress of the
policy is monotonic
State Visitation Probability

What’s the probability of visiting state s, a at time t


according to 𝜋 starting at s0?

d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)

ℙ𝜋(s0,a0,...st,at)=𝜋(a0|s0)p(s1|s0,a0)𝜋(a1|s1)p(s2|s1,a1)...p(st|st-1,at-1)𝜋(at|st)
State Visitation Probability

What’s the probability of visiting state s, a at time t


according to 𝜋 starting at s0?
𝜋 Note that d𝜋s0 is an infinite
d s0
(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0) mixture
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)

ℙ𝜋(s0,a0,...st,at)=𝜋(a0|s0)p(s1|s0,a0)𝜋(a1|s1)p(s2|s1,a1)...p(st|st-1,at-1)𝜋(at|st)
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or


iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or


iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large or continuous and we cannot


do exact or iterative policy evaluation for all states?
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or


iterative policy evaluation for all states?

Assumptions: the (infinite-horizon) MDP is still known, but


the state-space is too large to just enumerate all states
and compute V𝜋(s)
Approximate Policy Iteration

What if the state-space is large and we cannot do exact or


iterative policy evaluation for all states?

Assumptions:

(S, A, R, T, γ, 𝜇 0) is given

Q is in [0, 1/(1-γ)]
Approximate Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}

Procedure:

1. Start with a random guess 𝜋0


2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a

Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
Approximate Policy Iteration

● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}


Procedure:

1. Start with a random guess 𝜋0


2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a
Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
b. Do policy improvement as 𝜋t+1=argmaxaQ^𝜋t(s,a) for all s
argmax is still doable, we can still enumerate actions or
discretize them
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the


optimal policy will be close-to-optimal
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the


optimal policy will be close-to-optimal

Error bounds exist, but we will skip them for your happiness
:)
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the


optimal policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize


among states and avoid looking at the whole S
Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋

If the approximation is close to the true value, then the


optimal policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize


among states and avoid looking at the whole S

We use a function approximator


Approximate Policy Evaluation

We build an approximation V^𝜋 of the true value function V𝜋


If the approximation is close to the true value, then the optimal
policy will be close-to-optimal

Approximation for large state-spaces is needed to generalize


among states and avoid looking at the whole S
We use a function approximator
e.g., linear approximators, neural nets, non-parametric, etc.
Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that


Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)
Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)

THIS IS WHAT WE ARE GONNA DO FROM NOW ON: MDP IS UNKNOWN!!!


Approximate Policy Evaluation

To be fair, we can directly approximate Q, so let’s do that

What do we need?
Data and Least Square Regression

To be fair, we can directly approximate Q, so let’s do that


What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to
extract a function Q in the family of functions
SxA -> [0, 1/(1-γ)]
Data and Least Square Regression

To be fair, we can directly approximate Q, so let’s do that


What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to extract a
function Q in the family of functions
Q: SxA -> [0, 1/(1-γ)]
argminQ ⅀ N(Q(si,ai)-yi)2
in Q i=1
Data and Least Square Regression

argminQ ⅀
in Q i=1
N
(Q(si,ai)-yi)2

This is just a regression problem, which

● is numerically tractable
● has generalization bounds
Supervised Learning Digression
Supervised Learning
Supervised Learning: Regression

Given a data distribution D from which we sample points xi


and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):

f^ = argminf^ in

F i=1
N
(f^(xi
)-yi
) 2

with F={f^: X->ℝ}


Supervised Learning: Regression

Given a data distribution D from which we sample points xi


and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):
Empirical f^ = argminf^ ⅀ N
(f^(x )-y ) 2
in F i=1 i i
Risk
Minimizer with F={f^: X->ℝ}

We can generalize under the same data distribution

𝔼x~D(f^(x)-f(x))2≤δ with δ small


Supervised Learning: Distribution Mismatch

𝔼x~D’(f^(x)-f(x))2 can be huge!

If D’≠D
End
Supervised Learning Digression
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change

We cannot guarantee anymore


monotonic improvement!
Oscillation from Distribution Change

Our estimation is only good under d𝜇0𝜋 and to make sure we


have monotonic improvement we need a strong coverage assumption
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)


Sampling From Mixtures

p1

2
p
p = (1-α)p1 + αp2

● Flip a coin with probability [α, 1-α]


● Commit to a specific pi based on that and sample from pi
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific


ℙ𝜋h(s,a;s0)
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific


ℙ𝜋h(s,a;s0)
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)

● Sample h from γh(1-γ), thus committing to a specific


ℙ𝜋h(s,a;s0)
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
● Follow 𝜋 for h timesteps starting from s0 ~ 𝜇0 and get
sh, ah
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?


Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋
(s’)]
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

How do we get an unbiased


estimate of this?
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Sample many times and average!


Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

𝜋
Sample many times and average!

a
s s’
s’
s’
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Easier said than done :(

Infinite horizon!
Data Generation

2 steps:

1. Roll-in
2. Roll-out & compute supervision targets

Given s, a, how do we estimate Q𝜋(s,a)?

Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]

Use γ as a sampling factor


again to choose an horizon
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
D = {si,ai,yi}i=1N
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
why?
Data Generation

2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given (s,a), how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Probability of termination
at next state (h=0)

You might also like