Reinforcement Learning
Reinforcement Learning
Slides are compiled from various online sources with a great acknowledgment to all those who made the content available online.
Concepts
• What is Learning
• Machine Learning
• Types of Machine Learning
• Concept of Reinforcement Learning
• Q-Learning
• Open-AI Gym Environment
• Application of Reinforcement Learning in Self-driving cabs.
• Implementation of Self-driving cabs using Open-AI Gym.
Reinforcement Learning
• Learning to interact with an
environment
– Robots, games, process control
– With limited human training
– Where the ‘right thing’ isn’t obvious
Reinforcement Learning
• Reinforcement Learning:
– Goal:
Maximize σ∞
𝑖=1 𝑅𝑒𝑤𝑎𝑟𝑑(𝑆𝑡𝑎𝑡𝑒𝑖 , 𝐴𝑐𝑡𝑖𝑜𝑛𝑖 )
– Data:
𝑅𝑒𝑤𝑎𝑟𝑑𝑖 , 𝑆𝑡𝑎𝑡𝑒𝑖+1 = 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡(𝑆𝑡𝑎𝑡𝑒𝑖 , 𝐴𝑐𝑡𝑖𝑜𝑛𝑖 )
• In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and
representative of all the situations in which the agent has to
act.
• an agent must be able to learn from its own experience to
maximize the reward.
Unsupervised Vs. Reinfrocement
• Reinforcement learning is trying to maximize a reward signal
instead of trying to find hidden structure.
0, 0 1, 0 2, 0
• Exploration Vs. Exploitation
– Parameter 𝜖 cater this decision
0, 1 1, 1 2, 1
• Cumulative reward
0, 2 1, 2 2, 2
Elements of Reinforcement Learning Systems
• Reinforcement learning systems have 4 main elements:
– Policy
– Reward signal
– Value function
– Optional model of the environment
Policy
• A policy is a mapping from the perceived states of the environment to
actions to be taken in those states
• A reinforcement learning agent uses a policy to select actions given the
current environment state
Reward Signal
• The reward signal defines the goal
• On each time step, the environment sends a single number called the
reward to the reinforcement learning agent.
• The agent’s objective is to maximize the total reward that it receives
over the long run
• The reward signal is used to alter the policy
Value Function (1)
• The reward signal indicates what is good in the short run while the value
function indicates what is good in the long run
• The value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting in that state
• Compute the value using the states that are likely to follow the current
state and the rewards available in those states
• Future rewards may be time-discounted with a factor in the interval [0,
1]
Value Function (2)
• Use the values to make and evaluate decisions
• Action choices are made based on value judgements
• Prefer actions that bring about states of highest value instead of highest
reward
• Rewards are given directly by the environment
• Values must continually be re-estimated from the sequence of
observations that an agent makes over its lifetime
Model-free versus Model-based
• A model of the environment allows inferences to be made about how
the environment will behave
• Example: Given a state and an action to be taken while in that state, the
model could predict the next state and the next reward
• Models are used for planning, which means deciding on a course of
action by considering possible future situations before they are
experienced
• Model-based methods use models and planning. Think of this as
modelling the dynamics p(s’ | s, a)
• Model-free methods learn exclusively from trial-and-error (i.e. no
modelling of the environment)
• This presentation focuses on model-free methods
Setup for Reinforcement Learning
Policy
Markov Decision Process (environment) (agent’s behavior)
• Discrete-time stochastic control process
• 𝜋(𝑠) – The action to take in state 𝑠
• Each time step, 𝑠:
– Agent chooses action 𝑎 from set 𝐴𝑠 • Goal maximize: σ∞ 𝑡
𝑡=0 𝛾 𝑅𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 )
– Moves to new state with probability: – 𝑎𝑡 = 𝜋 𝑠𝑡
• 𝑃𝑎 (𝑠, 𝑠 ′ ) – 0 ≤ 𝛾 < 1 – Tradeoff immediate vs future
– Receives reward: 𝑅𝑎 (𝑠, 𝑠 ′ ) Probability of moving to each state
• 𝑉𝜋 𝑠 =
• Every outcome depends on 𝑠 and 𝑎 σ𝑠′ 𝑃𝜋 𝑠 (𝑠, 𝑠 ′ ) ∗
– Nothing depends on previous states/actions ( 𝑅𝜋 𝑠 𝑠, 𝑠 ′ + 𝛾𝑉 𝜋 𝑠 ′ )
Reward:
100 at chest
0 for others
Approach:
– Initialize estimate of discounted reward for every state/action pair: 𝑄 𝑠, 𝑎 = 0
– Repeat (for a while):
• Take a random action 𝑎 from 𝐴𝑠
• Receive 𝑠 ′ and 𝑅𝑎 (𝑠, 𝑠 ′ ) from environment
• Update 𝑄` 𝑡 (𝑠, 𝑎) = (1𝑅𝑎−𝑠,𝛼) 𝑡−1
𝑠 ′𝑄 + 𝛾(𝑠,
max𝑎)𝑄+ 𝑠𝛼[
′ 𝑅
, 𝑎 ′
𝑎 𝑠, 𝑠 ′
+ 𝛾 max 𝑡−1 𝑠 ′ , 𝑎′ ]
𝑄
𝑎 ′ 𝑎 ′
– 𝑅𝑎 𝑠, 𝑠 ′ = 0
0 0
0 0 100
0
0, 1 1, 1 2, 1
• Update 𝑄 < 1,1 >, 𝑒𝑎𝑠𝑡 = 0
0 0
0 0
0 0 0
• Random action from 𝐴<2,1> = 𝑛𝑜𝑟𝑡ℎ 0, 2
0
1, 2
0
2, 2
0
0 0
′
• 𝑠 =< 2,0 >
• 𝑅𝑎 𝑠, 𝑠 ′ = 100 0 0
𝑄 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max
′
𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎
Example of Q learning
(round 2)
• Round 2: Random initial state = < 2,2 >
• Random action from 𝐴<2,2> = 𝑛𝑜𝑟𝑡ℎ 0, 0 1, 0 2, 0
′
– 𝑠 =< 2,1 > 0 0
– 𝑅𝑎 𝑠, 𝑠 ′ = 0 0
0 0
0 0 0
• Random action from 𝐴<2,1> = 𝑛𝑜𝑟𝑡ℎ 0, 2
0
1, 2
0
2, 2
0
50
0 0
′
• 𝑠 =< 2,0 >
• 𝑅𝑎 𝑠, 𝑠 ′ = 100 0 0
𝑄 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max
′
𝑛−1 𝑠 ′ , 𝑎′
𝑄 𝛾 = 0.5
𝑎
Example of Q learning 𝑄 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max
′
𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎
0 0
Example of Q learning 𝑄 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max
′
𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎
0 0
Example of Q learning
(𝑄 after many, many runs…)
0, 0 1, 0 2, 0
50 100
• 𝑄 converged
25
12.5 25
25 50 100
0, 1 1, 1
• Policy is: 25 50
2, 1
𝑎)
𝜋 𝑠 = argmax 𝑄(𝑠, 12.5 25
𝑎𝜖𝐴𝑠 6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25
6.25 12.5
Self-driving Cabs Example
• Four different locations (R, G, Y, B).