Previous Lectures
• Supervised learning
– classification, regression
• Unsupervised learning
– clustering
• Reinforcement learning
– more general than supervised/unsupervised learning
– learn from interaction w/ environment to achieve a goal
environment
reward action
new state
agent
Slides from Peter Bodík RAD Lab, UC Berkeley
Recall
• examples
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning
– Policy Gradient Methods
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP
-1 80% move UP
10% move LEFT
10% move RIGHT
START
• reward +1 at [4,3], -1 at [4,2]
• reward -0.04 for each step
• what’s the strategy to achieve max reward?
• what if the actions were deterministic?
Other examples
• pole-balancing
• TD-Gammon [Gerry Tesauro]
• helicopter [Andrew Ng]
• no teacher who would say “good” or “bad”
– is reward “10” good or bad?
– rewards could be delayed
• similar to control theory
– more general, fewer constraints
• explore the environment and learn from experience
– not just blind search, try to be smart about it
Robot in a room
actions: UP, DOWN, LEFT, RIGHT
+1
UP
-1 80% move UP
10% move LEFT
10% move RIGHT
START
reward +1 at [4,3], -1 at [4,2]
reward -0.04 for each step
• states
• actions
• rewards
• what is the solution?
Is this a solution?
+1
-1
• only if actions deterministic
– not in this case (actions are stochastic)
• solution/policy
– mapping from each state to an action
Optimal policy
+1
-1
Reward for each step: -2
+1
-1
Reward for each step: -0.1
+1
-1
Reward for each step: -0.04
+1
-1
Reward for each step: -0.01
+1
-1
Reward for each step: +0.01
+1
-1
Markov Decision Process (MDP)
• set of states S, set of actions A, initial state S0
• transition model P(s,a,s’) environment
– P( [1,1], up, [1,2] ) = 0.8 reward action
new state
• reward function r(s) agent
– r( [4,3] ) = +1
• goal: maximize cumulative reward in the long run
• policy: mapping from S to A
– (s) or (s,a) (deterministic vs. stochastic)
• reinforcement learning
– transitions and rewards usually not available
– how to change the policy based on experience
– how to explore the environment
Computing return from rewards
• episodic (vs. continuing) tasks
– “game over” after N steps
– optimal policy depends on N; harder to analyze
• additive rewards
– V(s0, s1, …) = r(s0) + r(s1) + r(s2) + …
– infinite value for continuing tasks
• discounted rewards
– V(s0, s1, …) = r(s0) + γ*r(s1) + γ2*r(s2) + …
– value bounded if rewards bounded
Value functions
• state value function: V(s)
– expected return when starting in s and following
• state-action value function: Q(s,a)
– expected return when starting in s, performing a, and following
s
• useful for finding the optimal policy
– can estimate from experience a
– pick the best action using Q(s,a) r
s’
• Bellman equation
Optimal value functions
• there’s a set of optimal policies Backup Diagrams:
– V defines partial ordering on policies See Textbook from
Richard S. Sutton
– they share the same optimal value function
Chapter 3: Finite
Markov Decision
Processes
• Bellman optimality equation Page 48
– system of n non-linear equations a
– solve for V*(s) r
– easy to extract the optimal policy
s’
• having Q*(s,a) makes it even simpler
Dynamic programming
• main idea
– use value functions to structure the search for good policies
– need a perfect model of the environment
• two main components
– policy evaluation: compute V from
– policy improvement: improve based on V
– start with an arbitrary policy
– repeat evaluation/improvement until convergence
Policy evaluation/improvement
• policy evaluation: -> V
– Bellman eqn’s define a system of n eqn’s
– could solve, but will use iterative version
– start with an arbitrary value function V0, iterate until Vk converges
• policy improvement: V -> ’
– ’ either strictly better than , or ’ is optimal (if = ’)
Policy/Value iteration
• Policy iteration
– two nested iterations; too slow
– don’t need to converge to Vk
• just move towards it
• Value iteration
– use Bellman optimality equation as an update
– converges to V*
Using DP
• need complete model of the environment and rewards
– robot in a room
• state space, action space, transition model
• can we use DP to solve
– robot in a room?
– back gammon?
– helicopter?
Self Study Example
Iterative policy evaluation:
See Textbook from Richard S. Sutton
CHAPTER 4. DYNAMIC PROGRAMMING
Exercise 4.1, 4.2
Page 61-67
Generalized Policy Iteration (GPI)
Basic Idea:
• GPI allows policy evaluation and policy improvement to interact in a less rigid manner than traditional
policy iteration.
• Unlike strict policy iteration where evaluation and improvement alternate in distinct phases, GPI allows
for more flexible and interleaved interactions between these processes.
Outline
• examples
• defining an RL problem
– Markov Decision Processes
• solving an RL problem
– Dynamic Programming
– Monte Carlo methods
– Temporal-Difference learning