0% found this document useful (0 votes)

3 views8 pages

Unit-3 RL

This document discusses various concepts in reinforcement learning, focusing on Q-learning, Temporal-Difference (TD) learning, and R-learning. It explains the components of Q-learning, including agents, states, actions, rewards, and Q-values, as well as the methods for calculating Q-values using Temporal difference and Bellman's equation. Additionally, it highlights the differences between on-policy and off-policy learning, and the significance of TD learning in predicting expected values in sequences of states.

Uploaded by

HemaRamachandran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views8 pages

Unit-3 RL

Uploaded by

HemaRamachandran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT-3

LEARNING

Temporal-Difference prediction - Optimality of TD – Q

Learning – Off-Policy TD Control - R Learning –Actor Critic
Model- Unifying Monte Carlo and TD –Traces - Games

Q-learning:
Q-learning is a machine learning approach that enables a model to iteratively learn and
improve over time by taking the correct action. Q-learning is a type of reinforcement
learning.
Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning
approach aims to determine the optimal action based on its current state. The Q-learning
approach can accomplish this by either developing its own set of rules or deviating from
the prescribed policy. Because Q-learning may deviate from the given policy, a de ined
policy is not needed.

The multiple components of Q-learning include the following:

 Agents. The agent is the entity that acts and operates within an environment.

 States. The state is a variable that identifies the current position in an

environment of an agent.

 Actions. The action is the agent's operation when it is in a specific state.

 Rewards. A foundational concept within reinforcement learning is the

concept of providing either a positive or a negative response for the agent's
actions.

 Episodes. An episode is when an agent can no longer take a new action and
ends up terminating.

 Q-values. The Q-value is the metric used to measure an action at a particular

state.
Here are the two methods to determine the Q-value:

 Temporal difference. The temporal difference formula calculates the Q-value

by incorporating the value of the current state and action by comparing the
differences with the previous state and action.

 Bellman's equation. Bellman's equation is used to help calculate the value of

a given state and assess its relative position. The state with the highest value
is considered the optimal state.

What is a Q-table?

The Q-table includes columns and rows with lists of rewards for the best actions of each
state in a specific environment. A Q-table helps an agent understand what actions are
likely to lead to positive outcomes in different situations.

The table rows represent different situations the agent might encounter, and the columns
represent the actions it can take. As the agent interacts with the environment and
receives feedback in the form of rewards or penalties, the values in the Q-table are
updated to reflect what the model has learned.

The Q-table is directly related to the concept of the Q-function. The Q-function is a
mathematical equation that looks at the current state of the environment and the action
under consideration as inputs. The Q-function then generates outputs along with
expected future rewards for that action in the speci ic state. The Q-table allows the agent
to look up the expected future reward for any given state-action pair to move toward an
optimized state.

The procedural form of the algorithm is:

The parameters used in the Q-value update process are:

 - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values
are never updated, hence nothing is learned. Setting a high value such as 0.9 means
that learning can occur quickly.
 - discount factor, also set between 0 and 1. This models the fact that future
rewards are worth less than immediate rewards. Mathematically, the discount
factor needs to be set less than 0 for the algorithm to converge.
 - the maximum reward that is attainable in the state following the current
one. i.e the reward for taking the optimal action thereafter.

This procedural approach can be translated into plain english steps as follows:

1. Initialize the Q-values table, Q(s, a).

2. Observe the current state, s.
3. Choose an action, a, for that state based on one of the action selection policies
explained here on the previous page ( -soft, -greedy or softmax).
4. Take the action, and observe the reward, r, as well as the new state, s'.
5. Update the Q-value for the state using the observed reward and the maximum
reward possible for the next state. The updating is done according to the formula
and parameters described above.
6. Set the state to the new state, and repeat the process until a terminal state is
reached.

Off-Policy TD Control
On-policy and off-policy learning fall under the category of model-free reinforcement
learning algorithms, meaning that we do not have access to the transition probability
distribution.
This is opposed to model-based methods such as Monte-Carlo Tree Search (MCTS).
On-policy algorithms attempt to improve upon the current behavior policy that is
used to make decisions and therefore these algorithms learn the value of the policy
carried out by the agent.
Off-policy algorithms learn the value of the optimal policy, Q , and can improve upon
a policy that is different from the behavior policy.

Temporal Difference Learning (TD Learning)

One of the problems with the environment is that rewards usually are not immediately

observable. For example, in tic-tac-toe or others, we only know the reward(s) on the final

move (terminal state). All other moves will have 0 immediate rewards.

TD learning is an unsupervised technique to predict a variable's expected value in a

sequence of states. TD uses a mathematical trick to replace complex reasoning about the
future with a simple learning procedure that can produce the same results. Instead of
calculating the total future reward, TD tries to predict the combination of immediate
reward and its own reward prediction at the next moment in time.

Where the reward at time t is the combination of discounted rewards in the future. It
implies that future rewards are valued less. The TD Error is the difference between the
ultimate correct reward (V*_t) and our current prediction (V_t).
Parameters

Alpha (α): learning rate. This parameter shows how much we should adjust our

estimates based on the error. The learning rate is between 0 and 1. A large learning rate

adjusts aggressively and might lead to fluctuating training results — not converging. A

small learning rate adjusts slowly, which will take more time to converge.

Gamma (γ): the discount rate. How much we are valuing future rewards. The discount

rate is between 0 and 1. The bigger the discount rate, we more we valuing the future

rewards.

TD Prediction

As we saw for Monte Carlo methods, Prediction refers to the problem of estimating the

values of states, a value of a state is an indication of how good is that state for an agent in

the given environment, the higher the value of the state the better it is to be in that state.

Monte Carlo and Temporal Difference Learning are similar in the sense that they both use

real-world experience to evaluate a given policy, however, Monte Carlo methods wait until

the return following the visit is known which is after the episode ends is available to

update the value of the state, whereas TD methods update the state value in the next time

step, at the next time step t+1 they immediately form a target and make a useful update

using the observed reward.

Updating the state value just after one time step is called one-step TD or TD(0), which is a

special case of the TD(lambda) and n-step TD methods which are beyond the scope of our

discussion, however, the principles we explore here can be extended to those methods

without much complexity.

R Learning:

 R-learning is a method for optimizing average reward, weighing both far-future and
near-term reward the same.
 While optimising the average reward may seem more natural, it also has some
problems such as making agent “lazy”.
 As long as the agent eventually reaches the goal, it will get maximum average reward,
since only “the tail” matters for the average.
 There have been some attempts to work around this, but Q-learning remains the far
more popular algorithm.
Temporal Difference methods are said to combine the sampling of Monte Carlo with the

bootstrapping of DP, that is because in Monte Carlo methods target is an estimate because

we do not know the actual expected value rather use a sample return from that particular

episode, and in DP that target is an estimate because the value of the next state is not known

instead the current estimate is used, and in TD the target is an estimate because of both the

reasons, it samples the expected values and it uses the current estimate instead of the true

state value.

In the TD update equation, the quantity in brackets is a measure of an error, measuring

the difference between the estimated value of St and the better estimate at the next time

step. This quantity is called the TD error which has a widespread presence in all

reinforcement learning.

Reinforcement Learning in MDPs Explained
No ratings yet
Reinforcement Learning in MDPs Explained
7 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
11 pages
TD Learning and Model-Free Control Overview
No ratings yet
TD Learning and Model-Free Control Overview
9 pages
Off-Policy Monte Carlo Control in RL
No ratings yet
Off-Policy Monte Carlo Control in RL
18 pages
Information Security
No ratings yet
Information Security
8 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
17 pages
Monte Carlo Policy Evaluation in RL
No ratings yet
Monte Carlo Policy Evaluation in RL
245 pages
Temporal Difference Learning Overview
No ratings yet
Temporal Difference Learning Overview
54 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
13 pages
Reinforcement Learning for Robot Walking
No ratings yet
Reinforcement Learning for Robot Walking
9 pages
Understanding Q-Learning in RL
No ratings yet
Understanding Q-Learning in RL
30 pages
RLUNIT2
No ratings yet
RLUNIT2
20 pages
Unit 02 Reinforcement For Agentic AI
No ratings yet
Unit 02 Reinforcement For Agentic AI
56 pages
Monte Carlo and Bootstrapping in RL
No ratings yet
Monte Carlo and Bootstrapping in RL
6 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
77 pages
RL Cs
No ratings yet
RL Cs
1 page
Unit4 ML
No ratings yet
Unit4 ML
7 pages
Passive Reinforcement Learning Overview
No ratings yet
Passive Reinforcement Learning Overview
43 pages
Q-Learning and TD Learning Explained
No ratings yet
Q-Learning and TD Learning Explained
6 pages
RL Workshop Compressed
No ratings yet
RL Workshop Compressed
31 pages
Understanding Semi-Supervised and Reinforcement Learning
No ratings yet
Understanding Semi-Supervised and Reinforcement Learning
18 pages
Reinforcement Learning Explained
No ratings yet
Reinforcement Learning Explained
45 pages
AI A-Z: Building Intelligent Systems
No ratings yet
AI A-Z: Building Intelligent Systems
12 pages
Temporal-Difference Learning SARSA, Q-BR
No ratings yet
Temporal-Difference Learning SARSA, Q-BR
20 pages
Module 4 RFL
No ratings yet
Module 4 RFL
29 pages
Module 4 TD Learning
No ratings yet
Module 4 TD Learning
20 pages
Reinforcement Learning Insights from CS188
No ratings yet
Reinforcement Learning Insights from CS188
46 pages
Multi-Agent Reinforcement Learning in Hide and Seek
No ratings yet
Multi-Agent Reinforcement Learning in Hide and Seek
7 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
18 pages
Q Learing
No ratings yet
Q Learing
32 pages
Q-Learning in Reinforcement Learning
No ratings yet
Q-Learning in Reinforcement Learning
27 pages
09 RL
No ratings yet
09 RL
58 pages
Q-Learning in Reinforcement Learning
No ratings yet
Q-Learning in Reinforcement Learning
54 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
UNIT IV TD Monte Carlo Methods
No ratings yet
UNIT IV TD Monte Carlo Methods
87 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
6 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
26 pages
TD Learning in Reinforcement Learning
No ratings yet
TD Learning in Reinforcement Learning
57 pages
RL MID2 Ans
No ratings yet
RL MID2 Ans
18 pages
CS188 Project 2: Multi-Agent Pacman
No ratings yet
CS188 Project 2: Multi-Agent Pacman
38 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
6 pages
Bootstrapping in Reinforcement Learning
100% (1)
Bootstrapping in Reinforcement Learning
7 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
9 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
44 pages
06 Reinforcement Learning
No ratings yet
06 Reinforcement Learning
12 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
32 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
57 pages
Understanding Temporal-Difference Learning
No ratings yet
Understanding Temporal-Difference Learning
32 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
23 pages
2021 Division Math Fair Guidelines
No ratings yet
2021 Division Math Fair Guidelines
49 pages
Revised CAPEX Policy Clean
No ratings yet
Revised CAPEX Policy Clean
3 pages
DMV Affidavit on Oregon Traffic Laws
100% (7)
DMV Affidavit on Oregon Traffic Laws
12 pages
Understanding Leadership Styles in Business
No ratings yet
Understanding Leadership Styles in Business
9 pages
Transmission Testing Procedure Guide
No ratings yet
Transmission Testing Procedure Guide
4 pages
Supreme Court Jurisdiction on Ombudsman Case
No ratings yet
Supreme Court Jurisdiction on Ombudsman Case
1 page
Bond Enthalpy and Chemical Bonds Explained
No ratings yet
Bond Enthalpy and Chemical Bonds Explained
4 pages
IP2028 Dowpol
No ratings yet
IP2028 Dowpol
1 page
Understanding Organizational Behavior
100% (1)
Understanding Organizational Behavior
79 pages
Fashion Trends and Styles Overview
No ratings yet
Fashion Trends and Styles Overview
26 pages
Electrical Engineering II Assignment 1
No ratings yet
Electrical Engineering II Assignment 1
4 pages
Schedule A Instructions for Form 990
No ratings yet
Schedule A Instructions for Form 990
10 pages
Direct Variation in Mathematics 9
No ratings yet
Direct Variation in Mathematics 9
10 pages
IOCL Capital Structure Analysis Report
No ratings yet
IOCL Capital Structure Analysis Report
69 pages
Personal Segment Loans Project Report
No ratings yet
Personal Segment Loans Project Report
6 pages
KSF-9 Corporate Services EPF Statement
No ratings yet
KSF-9 Corporate Services EPF Statement
2 pages
Financial Analysis of GNA Enterprises
No ratings yet
Financial Analysis of GNA Enterprises
104 pages
World Bank Procurement Thresholds Guide
No ratings yet
World Bank Procurement Thresholds Guide
7 pages
Facebook's Role in Missing Persons Cases
No ratings yet
Facebook's Role in Missing Persons Cases
4 pages
Alberts DKK, 2018
No ratings yet
Alberts DKK, 2018
1 page
Vision CT12-180X Battery Capacity Info
No ratings yet
Vision CT12-180X Battery Capacity Info
2 pages
Verb Forms: To V, Bare V, V-ing
No ratings yet
Verb Forms: To V, Bare V, V-ing
2 pages
LHB Design Coaches: Suspension Questions
No ratings yet
LHB Design Coaches: Suspension Questions
22 pages
Year8 Chemistry Exam Easy Hard
No ratings yet
Year8 Chemistry Exam Easy Hard
3 pages
Provisional Admission for B.Tech Robotics
No ratings yet
Provisional Admission for B.Tech Robotics
4 pages
Format Document
No ratings yet
Format Document
3 pages
Operations and Maintenance Expertise
100% (1)
Operations and Maintenance Expertise
4 pages
Overview of Key International Organizations
No ratings yet
Overview of Key International Organizations
46 pages
Parallel Port Interfacing Guide
No ratings yet
Parallel Port Interfacing Guide
7 pages
Request for Summer Class Offering
No ratings yet
Request for Summer Class Offering
3 pages

Unit-3 RL

Uploaded by

Unit-3 RL

Uploaded by

UNIT-3

Temporal-Difference prediction - Optimality of TD – Q

The multiple components of Q-learning include the following:

 States. The state is a variable that identifies the current position in an

 Actions. The action is the agent's operation when it is in a specific state.

 Rewards. A foundational concept within reinforcement learning is the

 Q-values. The Q-value is the metric used to measure an action at a particular

 Temporal difference. The temporal difference formula calculates the Q-value

 Bellman's equation. Bellman's equation is used to help calculate the value of

The procedural form of the algorithm is:

1. Initialize the Q-values table, Q(s, a).

Temporal Difference Learning (TD Learning)

TD learning is an unsupervised technique to predict a variable's expected value in a

using the observed reward.

without much complexity.

In the TD update equation, the quantity in brackets is a measure of an error, measuring

You might also like