0% found this document useful (0 votes)
3 views29 pages

Reinforcement Learning

The document provides an introduction to Reinforcement Learning (RL), outlining its concepts, including learning types, Q-learning, and applications in self-driving cabs. It explains the process of RL, emphasizing the importance of observation, action, rewards, and refining strategies. Additionally, it contrasts RL with supervised and unsupervised learning, detailing key elements of RL systems such as policy, reward signal, and value function.

Uploaded by

Mukesh Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views29 pages

Reinforcement Learning

The document provides an introduction to Reinforcement Learning (RL), outlining its concepts, including learning types, Q-learning, and applications in self-driving cabs. It explains the process of RL, emphasizing the importance of observation, action, rewards, and refining strategies. Additionally, it contrasts RL with supervised and unsupervised learning, detailing key elements of RL systems such as policy, reward signal, and value function.

Uploaded by

Mukesh Verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Reinforcement Learning

Slides are compiled from various online sources with a great acknowledgment to all those who made the content available online.
Concepts
• What is Learning
• Machine Learning
• Types of Machine Learning
• Concept of Reinforcement Learning
• Q-Learning
• Open-AI Gym Environment
• Application of Reinforcement Learning in Self-driving cabs.
• Implementation of Self-driving cabs using Open-AI Gym.
Reinforcement Learning
• Learning to interact with an
environment
– Robots, games, process control
– With limited human training
– Where the ‘right thing’ isn’t obvious
Reinforcement Learning

1. Observation of the environment


2. Deciding how to act using some
strategy
3. Acting accordingly
4. Receiving a reward or penalty
5. Learning from the experiences and
refining our strategy
6. Iterate until an optimal strategy is
found
Supervised Vs. Reinfrocement
• Supervised Learning:
– Goal: 𝑓 𝑥 = 𝑦
– Data: [< 𝑥1 , 𝑦1 >, … , < 𝑥𝑛 , 𝑦𝑛 > ]

• Reinforcement Learning:
– Goal:
Maximize σ∞
𝑖=1 𝑅𝑒𝑤𝑎𝑟𝑑(𝑆𝑡𝑎𝑡𝑒𝑖 , 𝐴𝑐𝑡𝑖𝑜𝑛𝑖 )
– Data:
𝑅𝑒𝑤𝑎𝑟𝑑𝑖 , 𝑆𝑡𝑎𝑡𝑒𝑖+1 = 𝐼𝑛𝑡𝑒𝑟𝑎𝑐𝑡(𝑆𝑡𝑎𝑡𝑒𝑖 , 𝐴𝑐𝑡𝑖𝑜𝑛𝑖 )
• In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and
representative of all the situations in which the agent has to
act.
• an agent must be able to learn from its own experience to
maximize the reward.
Unsupervised Vs. Reinfrocement
• Reinforcement learning is trying to maximize a reward signal
instead of trying to find hidden structure.

• Uncovering structure in an agent’s experience can certainly be


useful in reinforcement learning, but by itself does not address
the reinforcement learning problem of maximizing a reward
signal.

• We therefore consider reinforcement learning to be a third


machine learning paradigm, alongside other paradigms.
How Reinforcement Learning is Different
• Delayed Reward

• Agent chooses training data

• Explore vs Exploit (Life long learning)

• Different terminology (can be confusing)


Example
• Action Space: East, West, North, South

0, 0 1, 0 2, 0
• Exploration Vs. Exploitation
– Parameter 𝜖 cater this decision
0, 1 1, 1 2, 1

• Cumulative reward
0, 2 1, 2 2, 2
Elements of Reinforcement Learning Systems
• Reinforcement learning systems have 4 main elements:
– Policy
– Reward signal
– Value function
– Optional model of the environment
Policy
• A policy is a mapping from the perceived states of the environment to
actions to be taken in those states
• A reinforcement learning agent uses a policy to select actions given the
current environment state
Reward Signal
• The reward signal defines the goal
• On each time step, the environment sends a single number called the
reward to the reinforcement learning agent.
• The agent’s objective is to maximize the total reward that it receives
over the long run
• The reward signal is used to alter the policy
Value Function (1)
• The reward signal indicates what is good in the short run while the value
function indicates what is good in the long run
• The value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting in that state
• Compute the value using the states that are likely to follow the current
state and the rewards available in those states
• Future rewards may be time-discounted with a factor in the interval [0,
1]
Value Function (2)
• Use the values to make and evaluate decisions
• Action choices are made based on value judgements
• Prefer actions that bring about states of highest value instead of highest
reward
• Rewards are given directly by the environment
• Values must continually be re-estimated from the sequence of
observations that an agent makes over its lifetime
Model-free versus Model-based
• A model of the environment allows inferences to be made about how
the environment will behave
• Example: Given a state and an action to be taken while in that state, the
model could predict the next state and the next reward
• Models are used for planning, which means deciding on a course of
action by considering possible future situations before they are
experienced
• Model-based methods use models and planning. Think of this as
modelling the dynamics p(s’ | s, a)
• Model-free methods learn exclusively from trial-and-error (i.e. no
modelling of the environment)
• This presentation focuses on model-free methods
Setup for Reinforcement Learning

Policy
Markov Decision Process (environment) (agent’s behavior)
• Discrete-time stochastic control process
• 𝜋(𝑠) – The action to take in state 𝑠
• Each time step, 𝑠:
– Agent chooses action 𝑎 from set 𝐴𝑠 • Goal maximize: σ∞ 𝑡
𝑡=0 𝛾 𝑅𝑎𝑡 (𝑠𝑡 , 𝑠𝑡+1 )
– Moves to new state with probability: – 𝑎𝑡 = 𝜋 𝑠𝑡
• 𝑃𝑎 (𝑠, 𝑠 ′ ) – 0 ≤ 𝛾 < 1 – Tradeoff immediate vs future
– Receives reward: 𝑅𝑎 (𝑠, 𝑠 ′ ) Probability of moving to each state

• 𝑉𝜋 𝑠 =
• Every outcome depends on 𝑠 and 𝑎 σ𝑠′ 𝑃𝜋 𝑠 (𝑠, 𝑠 ′ ) ∗
– Nothing depends on previous states/actions ( 𝑅𝜋 𝑠 𝑠, 𝑠 ′ + 𝛾𝑉 𝜋 𝑠 ′ )

Reward for making that move Value of being in that state


Simple Example of Agent in an Environment
State:
Map Locations
Score: 100
0
{< 0,0 >, < 1,0 > ⋯ < 3,3 >}
0, 0 1, 0 2, 0
100
Actions:
Move within map
Reaching chest ends episode 0, 1 1, 1 2, 1

𝐴0,0 = { 𝑒𝑎𝑠𝑡, 𝑠𝑜𝑢𝑡ℎ }


𝐴1,0 = { 𝑒𝑎𝑠𝑡, 𝑠𝑜𝑢𝑡ℎ, 𝑤𝑒𝑠𝑡 }
𝐴2,0 = 𝜙 0, 2 1, 2 2, 2

𝐴2,2 = { 𝑛𝑜𝑟𝑡ℎ, 𝑤𝑒𝑠𝑡 }

Reward:
100 at chest
0 for others

𝑅𝑒𝑎𝑠𝑡 < 1,0 >, < 2,0 > = 100


𝑅𝑛𝑜𝑟𝑡ℎ < 2,1 >, < 2,0 > = 100
𝑅∗ ∗,∗ = 0
𝑅𝑒𝑎𝑠𝑡 < 1,0 >, < 2,0 > = 100

Policies 𝑅𝑛𝑜𝑟𝑡ℎ < 2,1 >, < 2,0 > = 100


𝑅∗ ∗,∗
𝛾 = 0.5
=0

Policy Evaluating Policies



𝜋 𝑠 =𝑎 0, 0 1, 0 2, 0

12.5 100 𝑉 𝜋 𝑠 = ෍ 𝛾 𝑖 𝑟𝑖+1


𝜋 < 0,0 > = { 𝑠𝑜𝑢𝑡ℎ } 0, 1 1, 1 2, 1
𝑖=0
𝜋 < 0,1 > = { 𝑒𝑎𝑠𝑡 } 50
𝜋 < 0,2 > = { 𝑒𝑎𝑠𝑡 } 𝑉 𝜋 < 1,0 > = 𝛾 0 ∗ 100
𝜋 < 1,0 > = {𝑒𝑎𝑠𝑡 } 0, 2 1, 2 2, 2

𝜋 < 1,1 > = { 𝑛𝑜𝑟𝑡ℎ }


𝜋 < 1,2 > = { 𝑛𝑜𝑟𝑡ℎ }
𝑉 𝜋 < 1,1 > = 𝛾 0 ∗ 0 + 𝛾 1 ∗ 100
𝜋 < 2,0 > = { 𝜙 }
𝜋 < 2,1 > = { 𝑤𝑒𝑠𝑡 } Move to <1,1>
Move to <0,1> Move to <1,0>
Move to <2,0>
𝜋 < 2,2 > = { 𝑛𝑜𝑟𝑡ℎ }
𝑉 𝜋 < 0,0 > = 𝛾 0 ∗ 0 + 𝛾 1 ∗ 0 + 𝛾 2 ∗ 0 + 𝛾 3 ∗ 100
Policy could be better
Q learning
Learn a policy 𝜋(𝑠) that optimizes 𝑉 𝜋 𝑠 for all states, using:
– No prior knowledge of state transition probabilities: 𝑃𝑎 (𝑠, 𝑠 ′ )
– No prior knowledge of the reward function: 𝑅𝑎 (𝑠, 𝑠 ′ )

Approach:
– Initialize estimate of discounted reward for every state/action pair: 𝑄෠ 𝑠, 𝑎 = 0
– Repeat (for a while):
• Take a random action 𝑎 from 𝐴𝑠
• Receive 𝑠 ′ and 𝑅𝑎 (𝑠, 𝑠 ′ ) from environment
• Update 𝑄`෠ 𝑡 (𝑠, 𝑎) = (1𝑅𝑎−𝑠,𝛼) ෠𝑡−1
𝑠 ′𝑄 + 𝛾(𝑠,
max𝑎)𝑄෠+ 𝑠𝛼[
′ 𝑅
, 𝑎 ′
𝑎 𝑠, 𝑠 ′
+ 𝛾 max ෠𝑡−1 𝑠 ′ , 𝑎′ ]
𝑄
𝑎 ′ 𝑎 ′

• Random restart if in terminal state

𝛼 is the learning rate and 𝛾 is the discount factor


Example of Q learning
(round 1)
• Initialize 𝑄෠ to 0
• Random initial state = < 1,1 > 0, 0 1, 0 2, 0

0 0
Random action from 𝐴<1,1> = 𝑒𝑎𝑠𝑡
– 𝑠 ′ =< 2,1 > 0

– 𝑅𝑎 𝑠, 𝑠 ′ = 0
0 0
0 0 100
0
0, 1 1, 1 2, 1
• Update 𝑄෠ < 1,1 >, 𝑒𝑎𝑠𝑡 = 0
0 0

0 0
0 0 0
• Random action from 𝐴<2,1> = 𝑛𝑜𝑟𝑡ℎ 0, 2
0
1, 2
0
2, 2
0

0 0

• 𝑠 =< 2,0 >
• 𝑅𝑎 𝑠, 𝑠 ′ = 100 0 0

• Update 𝑄෠ < 2,1 >, 𝑛𝑜𝑟𝑡ℎ = 100

• No more moves possible, start again…

𝑄෠ 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max

෠𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎
Example of Q learning
(round 2)
• Round 2: Random initial state = < 2,2 >
• Random action from 𝐴<2,2> = 𝑛𝑜𝑟𝑡ℎ 0, 0 1, 0 2, 0

– 𝑠 =< 2,1 > 0 0

– 𝑅𝑎 𝑠, 𝑠 ′ = 0 0

• Update 𝑄෠ < 2,2 >, 𝑛𝑜𝑟𝑡ℎ = 0 + 𝛾 * 100


0 0
0 0 100
0, 1 1, 1 2, 1
0 0

0 0
0 0 0
• Random action from 𝐴<2,1> = 𝑛𝑜𝑟𝑡ℎ 0, 2
0
1, 2
0
2, 2
0
50

0 0

• 𝑠 =< 2,0 >
• 𝑅𝑎 𝑠, 𝑠 ′ = 100 0 0

• Update 𝑄෠ < 2,1 >, 𝑛𝑜𝑟𝑡ℎ = 𝑠𝑡𝑖𝑙𝑙 100

• No more moves possible, start again…

𝑄෠ 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max

෠𝑛−1 𝑠 ′ , 𝑎′
𝑄 𝛾 = 0.5
𝑎
Example of Q learning 𝑄෠ 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max

෠𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎

(some acceleration…) 𝛾 = 0.5

• Random Initial State < 0,0 > 0, 0


0
1, 0
0
2, 0

• Update 𝑄෠ < 1,1 >, 𝑒𝑎𝑠𝑡 = 50


0 0
0 0 100
0, 1 1, 1 2, 1
0 0
50

• Update 𝑄෠ < 1,2 >, 𝑒𝑎𝑠𝑡 = 25 0


0
0
0
0
0 0 50
0, 2 1, 2 2, 2
0 25
0

0 0
Example of Q learning 𝑄෠ 𝑠, 𝑎 = 𝑅𝑎 𝑠, 𝑠 ′ + 𝛾 max

෠𝑛−1 𝑠 ′ , 𝑎′
𝑄
𝑎

(some acceleration…) 𝛾 = 0.5

• Random Initial State < 0,2 > 0, 0


0
1, 0
100
0
2, 0

• Update 𝑄෠ < 0,1 >, 𝑒𝑎𝑠𝑡 = 25


0 0
0 0 100
0, 1 1, 1 2, 1
25
0 50

• Update 𝑄෠ < 1,0 >, 𝑒𝑎𝑠𝑡 = 100 0


0
0
0
0
0 0 50
0, 2 1, 2 2, 2
0 25

0 0
Example of Q learning
(𝑄෠ after many, many runs…)

0, 0 1, 0 2, 0
50 100

• 𝑄෠ converged
25
12.5 25
25 50 100
0, 1 1, 1
• Policy is: 25 50
2, 1

෠ 𝑎)
𝜋 𝑠 = argmax 𝑄(𝑠, 12.5 25
𝑎𝜖𝐴𝑠 6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25

6.25 12.5
Self-driving Cabs Example
• Four different locations (R, G, Y, B).

• Passenger is in location Y and they wish to go


to location R.

• The State Space is the set of all possible


situations our taxi could inhabit.

• Our taxi environment has 5×5×5×4=500 total


possible states.
Action Space
• we have six possible actions:
– south
– north
– east
– west
– pickup
– dropoff
Reward
• -1 for each step, if the step is not reaching to the
destination
• -10 for pickup/dropoff at wrong location.
• 20 for the right pickup/dropoff action.
OpenAI gym
• This python library gives us a huge number of test
environments to work on our Reinforcement Learning
agent’s algorithms with shared interfaces for writing
general algorithms and testing them.
• How to install ?
– Type pip install gym on the terminal for easy install.
– You’ll get some classic environment to start working on your
agent.
• You can check out other available environments like
Algorithmic, Atari, Box2D and Robotics here:
[Link]
References:
• Richard S. Sutton, and Andrew G. Barto, Reinforcement
Learning: An Introduction, second edition.
• [Link]
with-openai-d445c2c687d2
• Geoff Hulten, Reinforcement Learning
Happy Learning!
Thank you !!

You might also like