Approximate Policy Iteration
Reinforcement Learning
Roberto Capobianco
Recap
Finite-Horizon MDPs
Slightly different formulation:
(S, A, R, T, H, 𝜇 0)
(time-horizon)
H ≥ 0 and s0 ~ 𝜇 0 (initial state distribution)
We consider time-dependent policies 𝜋
𝜋 = {𝜋0, 𝜋1, 𝜋2...𝜋H-1}
Actions might be different for the same state depending on t
Very common in control!
𝜋
Finite-Horizon MDP: V & Q
s
Vh𝜋(s) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]
where sh=s, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)
Qh𝜋(s,a) = 𝔼𝜋[⅀ⲧ=hH-1r(sⲧ,aⲧ)]
where sh=s, ah=a, aⲧ=𝜋ⲧ(sⲧ) and sⲧ+1~P(.|sⲧ,aⲧ)
𝜋
a
s s’
Finite-Horizon MDP: Bellman Equation
Qh𝜋(s,a) = r(s,a)+𝔼s’~p(.|s,a)[Vh+1𝜋(s’)]
𝜋
h h+1
a
𝜋
Qh s s’
Finding the Optimal Policy
𝜋* = {𝜋0*, 𝜋1*, 𝜋2*...𝜋H-1*}
Let’s reason backwards in time and apply dynamic
programming:
QH-1*(s,a) = r(s,a)
𝜋H-1*(s)=argmaxaQH-1*(s,a)
VH-1*(s)=maxaQH-1*(s,a)=QH-1*(s,𝜋H-1*(s))
Control Problems
So far, we assumed discrete state and action spaces, but
what about cartpole:
● state: angular pos & vel, linear
pos & vel
● action/control: force applied on
the cart
● goal: find the control policy
which minimizes the long term
cost c
Optimal Control
Given a dynamical system with a non-linear transition
function f, state x in ℝd and control u in ℝk, we want to find
a control policy 𝜋 such that
minimize 𝔼𝜋[cH(xH) + ⅀h=0H-1ch(xh,uh)]
where uh=𝜋(xh) and x0 ~ 𝜇0
Now this seems very familiar! Can we treat it as a
Finite-Horizon MDP
Bellman’s Curse of Dimensionality
● n-dimensional (discrete) state space
● The number of states grows exponentially in n
In practice discretization is useful, but it is only
computationally feasible up to 5 or 6 dimensional state
spaces
Let’s try to work directly in continuous space, starting
from simplified problems
Linear Systems
Consider a system of this kind:
This is our
xt+1=Axt+But
transition function!
● xt state at time t
● ut control (i.e., action) at time t
A in ℝdxd, B in ℝdxk
Quadratic Cost Function
Consider a cost function of this kind
c(xt,ut)=xtTQxt+utTRut alternative notation
g(xt,ut)
dxd kxk
● Q in ℝ and R in ℝ square matrices
● Q and R positive definite
As a result, there is a non-zero cost for any non-zero state
with all-zero control
LQR algorithm
● Initialize PH (at 0 or Q)
● Starting from h = H-1, backwards
○ Set Kh* = -(R+BTPh+1B)-1BTPh+1A
○ Compute u = 𝜋h*(xt) = Kh*xt
○ Set Ph = (Q + Kh*TRKh* + (A + BKh*)TPh+1(A + BKh*)) Riccati
○ Set Jh*= xtTPhxt Equation
This is the Value Iteration update done in closed form: it is
always the same and solves this particular continuous-state
system with a quadratic cost
LQR Extensions
Extensions to the LQR make it more generally applicable to:
● Affine systems
● Systems with stochasticity
● Regulation around non-zero fixed point for non-linear systems
● Trajectory following for non-linear systems
● ...
End Recap
Policy Iteration
● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}
● Different from Value Iteration that was outputting values
Procedure:
1. Start with a random guess 𝜋0 (can be deterministic or stochastic)
2. For t=0,...,T: Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋(s’)]
a. Do policy evaluation and compute Q𝜋t for all s,a
b. Do policy improvement as 𝜋t+1=argmaxaQ𝜋t(s,a) for all s
This algorithm only makes progress, and the performance progress of the
policy is monotonic
State Visitation Probability
What’s the probability of visiting state s, a at time t
according to 𝜋 starting at s0?
d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)
ℙ𝜋(s0,a0,...st,at)=𝜋(a0|s0)p(s1|s0,a0)𝜋(a1|s1)p(s2|s1,a1)...p(st|st-1,at-1)𝜋(at|st)
State Visitation Probability
What’s the probability of visiting state s, a at time t
according to 𝜋 starting at s0?
𝜋 Note that d𝜋s0 is an infinite
d s0
(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0) mixture
ℙ𝜋t(s,a;s0)=⅀a0,s1,a1,...st-1,at-1ℙ𝜋(s0,a0,...st=s,at=a)
ℙ𝜋(s0,a0,...st,at)=𝜋(a0|s0)p(s1|s0,a0)𝜋(a1|s1)p(s2|s1,a1)...p(st|st-1,at-1)𝜋(at|st)
Approximate Policy Iteration
What if the state-space is large and we cannot do exact or
iterative policy evaluation for all states?
Approximate Policy Iteration
What if the state-space is large and we cannot do exact or
iterative policy evaluation for all states?
Approximate Policy Iteration
What if the state-space is large or continuous and we cannot
do exact or iterative policy evaluation for all states?
Approximate Policy Iteration
What if the state-space is large and we cannot do exact or
iterative policy evaluation for all states?
Assumptions: the (infinite-horizon) MDP is still known, but
the state-space is too large to just enumerate all states
and compute V𝜋(s)
Approximate Policy Iteration
What if the state-space is large and we cannot do exact or
iterative policy evaluation for all states?
Assumptions:
(S, A, R, T, γ, 𝜇 0) is given
Q is in [0, 1/(1-γ)]
Approximate Policy Iteration
● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}
Procedure:
1. Start with a random guess 𝜋0
2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a
Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
Approximate Policy Iteration
● Outputs policies at every iteration: {𝜋0, 𝜋1, 𝜋2...𝜋T}
Procedure:
1. Start with a random guess 𝜋0
2. For t=0,...,T:
a. Do policy evaluation and compute Q^𝜋t for all s,a
Q^𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V^𝜋(s’)]
b. Do policy improvement as 𝜋t+1=argmaxaQ^𝜋t(s,a) for all s
argmax is still doable, we can still enumerate actions or
discretize them
Approximate Policy Evaluation
We build an approximation V^𝜋 of the true value function V𝜋
If the approximation is close to the true value, then the
optimal policy will be close-to-optimal
Approximate Policy Evaluation
We build an approximation V^𝜋 of the true value function V𝜋
If the approximation is close to the true value, then the
optimal policy will be close-to-optimal
Error bounds exist, but we will skip them for your happiness
:)
Approximate Policy Evaluation
We build an approximation V^𝜋 of the true value function V𝜋
If the approximation is close to the true value, then the
optimal policy will be close-to-optimal
Approximation for large state-spaces is needed to generalize
among states and avoid looking at the whole S
Approximate Policy Evaluation
We build an approximation V^𝜋 of the true value function V𝜋
If the approximation is close to the true value, then the
optimal policy will be close-to-optimal
Approximation for large state-spaces is needed to generalize
among states and avoid looking at the whole S
We use a function approximator
Approximate Policy Evaluation
We build an approximation V^𝜋 of the true value function V𝜋
If the approximation is close to the true value, then the optimal
policy will be close-to-optimal
Approximation for large state-spaces is needed to generalize
among states and avoid looking at the whole S
We use a function approximator
e.g., linear approximators, neural nets, non-parametric, etc.
Approximate Policy Evaluation
To be fair, we can directly approximate Q, so let’s do that
Approximate Policy Evaluation
To be fair, we can directly approximate Q, so let’s do that
Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)
Approximate Policy Evaluation
To be fair, we can directly approximate Q, so let’s do that
Note that this also means that we can also get rid of the
assumption of knowing the MDP (since the transition is not
needed)
THIS IS WHAT WE ARE GONNA DO FROM NOW ON: MDP IS UNKNOWN!!!
Approximate Policy Evaluation
To be fair, we can directly approximate Q, so let’s do that
What do we need?
Data and Least Square Regression
To be fair, we can directly approximate Q, so let’s do that
What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to
extract a function Q in the family of functions
SxA -> [0, 1/(1-γ)]
Data and Least Square Regression
To be fair, we can directly approximate Q, so let’s do that
What do we need?
DATA D = {si,ai,yi}i=1N with y being our label!
with those we can then use least-square regression to extract a
function Q in the family of functions
Q: SxA -> [0, 1/(1-γ)]
argminQ ⅀ N(Q(si,ai)-yi)2
in Q i=1
Data and Least Square Regression
argminQ ⅀
in Q i=1
N
(Q(si,ai)-yi)2
This is just a regression problem, which
● is numerically tractable
● has generalization bounds
Supervised Learning Digression
Supervised Learning
Supervised Learning: Regression
Given a data distribution D from which we sample points xi
and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):
f^ = argminf^ in
⅀
F i=1
N
(f^(xi
)-yi
) 2
with F={f^: X->ℝ}
Supervised Learning: Regression
Given a data distribution D from which we sample points xi
and labels yi=f(xi)+ϵi, with 𝔼[ϵi]=0 and |ϵi|≤c, we want to
approximate f using a finite set of data (dataset):
Empirical f^ = argminf^ ⅀ N
(f^(x )-y ) 2
in F i=1 i i
Risk
Minimizer with F={f^: X->ℝ}
We can generalize under the same data distribution
𝔼x~D(f^(x)-f(x))2≤δ with δ small
Supervised Learning: Distribution Mismatch
𝔼x~D’(f^(x)-f(x))2 can be huge!
If D’≠D
End
Supervised Learning Digression
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
Oscillation from Distribution Change
We cannot guarantee anymore
monotonic improvement!
Oscillation from Distribution Change
Our estimation is only good under d𝜇0𝜋 and to make sure we
have monotonic improvement we need a strong coverage assumption
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
Sampling From Mixtures
p1
2
p
p = (1-α)p1 + αp2
● Flip a coin with probability [α, 1-α]
● Commit to a specific pi based on that and sample from pi
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
We want to sample our (s,a)~ d𝜋s0(s,a) = (1-γ)⅀∞h=0γhℙ𝜋h(s,a;s0)
● Sample h from γh(1-γ), thus committing to a specific
ℙ𝜋h(s,a;s0)
● Follow 𝜋 for h timesteps starting from s0 ~ 𝜇0 and get
sh, ah
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
Q𝜋(st,a) = rt+γ𝔼s’~p(.|s,a)[V𝜋
(s’)]
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
How do we get an unbiased
estimate of this?
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
Sample many times and average!
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
𝜋
Sample many times and average!
a
s s’
s’
s’
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
Easier said than done :(
Infinite horizon!
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
Q𝜋(st,at) = 𝔼[⅀ ∞h=0γhrh|(s0,a0)=(st,at),ah+1=𝜋(sh),sh+1∼p(.|sh,ah)]
Use γ as a sampling factor
again to choose an horizon
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
D = {si,ai,yi}i=1N
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given s, a, how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
why?
Data Generation
2 steps:
1. Roll-in
2. Roll-out & compute supervision targets
Given (s,a), how do we estimate Q𝜋(s,a)?
● Start at s,a
● Repeat:
○ Get r(s,a)
○ With probability 1-γ terminate and return y=⅀γhrh
○ Execute action and get in s’
Probability of termination
at next state (h=0)