0% found this document useful (0 votes)

88 views9 pages

Q-Learning Explained for Robotics

The document describes how Q-learning can be used to train a robot to navigate a maze and reach the goal point without stepping on mines. The robot's position in the maze is modeled as states in a Q-table, while possible movements are actions. Initially random, the robot learns over time to take advantageous actions by updating the Q-table values using rewards and penalties. Through iterative training, the robot exploits what it has learned to find the shortest path to the goal.

Uploaded by

Surya Mudaliar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views9 pages

Q-Learning Explained for Robotics

Uploaded by

Surya Mudaliar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Q-Learning — a simplistic overview

Let’s say that a robot has to cross a maze and reach the end

point. There are mines, and the robot can only move one tile
at a time. If the robot steps onto a mine, the robot is dead.
The robot has to reach the end point in the shortest time
possible.
The scoring/reward system is as below:

1. The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
2. If the robot steps on a mine, the point loss is 100 and the game
ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without stepping on a
mine?

So, how do we solve this?

Introducing the Q-Table

Q-Table is just a fancy name for a simple lookup table where
we calculate the maximum expected future rewards for action
at each state. Basically, this table will guide us to the best
action at each state.

There will be four numbers of actions at each non-edge tile.

When a robot is at a state it can either move up or down or
right or left.

So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions and the rows are
the states.
Each Q-table score will be the maximum expected future
reward that the robot will get if it takes that action at that
state. This is an iterative process, as we need to improve the
Q-Table at each iteration.

But the questions are:

 How do we calculate the values of the Q-table?

 Are the values available or predefined?
To learn each value of the Q-table, we use the Q-Learning
algorithm.
Mathematics: the Q-Learning algorithm
Q-function
The Q-function uses the Bellman equation and takes two
inputs: state (s) and action (a).

Using the above function, we get the values of Q for the cells
in the table.
When we start, all the values in the Q-table are zeros.

There is an iterative process of updating the values. As we

start to explore the environment, the Q-function gives us
better and better approximations by continuously updating the
Q-values in the table.
Now, let’s understand how the updating takes place.

Introducing the Q-learning algorithm process

Each of the colored boxes is one step. Let’s understand each
of these steps in detail.

Step 1: initialize the Q-Table

We will first build a Q-table. There are n columns, where n=
number of actions. There are m rows, where m= number of
states. We will initialise the values at 0.
In our robot example, we have four actions (a=4) and five
states (s=5). So we will build a table with four columns and
five rows.

Steps 2 and 3: choose and perform an action

This combination of steps is done for an undefined amount of
time. This means that this step runs until the time we stop the
training, or the training loop stops as defined in the code.

We will choose an action (a) in the state (s) based on the Q-

Table. But, as mentioned earlier, when the episode initially
starts, every Q-value is 0.
So now the concept of exploration and exploitation trade-off
comes into play. This article has more details.
We’ll use something called the epsilon greedy strategy.
In the beginning, the epsilon rates will be higher. The robot
will explore the environment and randomly choose actions.
The logic behind this is that the robot does not know anything
about the environment.

As the robot explores the environment, the epsilon rate

decreases and the robot starts to exploit the environment.

During the process of exploration, the robot progressively

becomes more confident in estimating the Q-values.

For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now —
our robot knows nothing about the environment. So the robot
chooses a random action, say right.

We can now update the Q-values for being at the start and
moving right using the Bellman equation.

Steps 4 and 5: evaluate

Now we have taken an action and observed an outcome and
[Link] need to update the function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward
structure is:

 power = +1
 mine = -100
 end = +100

We will repeat this again and again until the learning is

stopped. In this way the Q-Table will be updated.
Python implementation of Q-Learning
The concept and code implementation are explained in my
video.
Subscribe to my YouTube channel For more AI
videos : ADL .
At last…let us recap
 Q-Learning is a value-based reinforcement learning algorithm
which is used to find the optimal action-selection policy using
a Q function.
 Our goal is to maximize the value function Q.
 The Q table helps us to find the best action for each state.
 It helps to maximize the expected reward by selecting the best
of all possible actions.
 Q(state, action) returns the expected future reward of that
action at that state.
 This function can be estimated using Q-Learning, which
iteratively updates Q(s,a) using the Bellman equation.
 Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
SARSA is an on-policy algorithm where, in the current state, S
an action, A is taken and the agent gets a reward, R and ends up
in next state, S1 and takes action, A1 in S1. Therefore, the tuple
(S, A, R, S1, A1) stands for the acronym SARSA.

It is called an on-policy algorithm because it updates the policy

based on actions taken.

SARSA vs Q-learning
The difference between these two algorithms is
that SARSA chooses an action following the same current
policy and updates its Q-values whereas Q-learning chooses
the greedy action, that is, the action that gives the maximum Q-
value for the state, that is, it follows an optimal policy.

Basically, the Q-value is updated taking into account the action,

A1 performed in the state, S1 in SARSA as opposed to Q-
learning where the action with the highest Q-value in the next
state, S1 is used to update Q-table.

Common questions

The Q-Table is initialized with zeros in Q-Learning because the agent starts with no prior information about the environment or expected rewards. As the agent interacts with the environment, it updates the Q-Values iteratively using the Bellman equation, which considers the immediate reward received and the maximum expected future rewards. This process updates the Q-values towards more accurate estimations based on the actions taken and observed outcomes, thereby improving the agent's policy over time .

Implementing Q-Learning in real-world robotic navigation faces several challenges despite its theoretical robustness. One major issue is ensuring sufficient exploration without risking frequent collisions with hazards like mines, which may require carefully tuning exploration parameters such as the epsilon in epsilon-greedy strategies. Additionally, the computational and memory demands for maintaining and updating the Q-table could be significant, especially in environments with vast state-action spaces. Real-time constraints and the need for adaptive responses further complicate this implementation, along with issues of sensor reliability and environment model inaccuracies, which must be managed to realize practical deployments .

The Bellman equation is integral to Q-Learning as it provides a recursive method to compute the expected return of choosing an action at a given state. It allows the agent to update its Q-values based on immediate rewards and the estimated future value of the resulting state, serving as the backbone for optimizing the policy. The equation enables breaking down the problem of finding optimal actions into smaller, solvable units, providing a structured approach to value iteration and policy improvement. Its iterative nature helps refine estimates through repeated interactions with the environment, ensuring convergence towards the optimal policy .

The Q-Learning algorithm determines the best action at each state by using a Q-table, which is a lookup table containing the maximum expected future rewards for actions at each state. It uses the Bellman equation to iteratively update the Q-values in the table. Initially, the Q-values are zero, and they are updated based on the rewards received for the actions taken. The algorithm employs an exploration-exploitation trade-off strategy, initially exploring the environment with random actions to learn, and gradually shifting to exploiting learned actions that maximize expected rewards .

Reward shaping can significantly impact Q-Learning by adjusting the structure and magnitude of rewards to guide the agent's learning process. In a maze environment, incorporating incremental rewards for approaching the goal and penalties for specific actions or hazards can accelerate learning and ensure more desirable behavior paths. It reduces the sparse reward problem, providing continuous feedback that assists the agent in understanding which state-action pairs contribute positively to achieving the goal. However, care must be taken to maintain consistency between shaped and original rewards to prevent the development of unintended strategies or convergence issues .

Q-Learning traditionally deals with discrete state-action spaces, but in continuous systems, function approximation methods such as neural networks can be used in place of traditional Q-tables to estimate Q-values. This adapation, known as a Deep Q Network (DQN), uses deep learning to create a continuous approximation of the Q-value functions, enabling Q-Learning to generalize across vast and continuous state-action spaces efficiently. These modifications leverage the capacity of neural networks to learn feature representations, thus seamlessly applying the principles of Q-learning to complex continuous environments .

Off-policy algorithms, such as Q-Learning, select and evaluate actions using potentially different policies. The agent updates its Q-values based on what is deemed optimal, regardless of the policy currently used to make actions, allowing for faster convergence to optimal solutions. In contrast, on-policy algorithms like SARSA update Q-values by considering the current policy to evaluate and select actions. This approach can be advantageous in dynamic settings where actions have consequences on future states and require careful adaptation, leading to policies more attuned to risk management. Q-Learning is typically applied in scenarios where the environment model is unknown and the optimal performance is desired, while SARSA is better suited for environments where adhering to an in-training policy is crucial due to potential state-action dependencies or safety considerations .

The epsilon-greedy strategy facilitates the learning process in Q-Learning by balancing exploration and exploitation. It allows the agent to select random actions with probability epsilon, encouraging exploration of unknown states, while choosing the best-known action with probability (1-epsilon), enabling exploitation. This strategy is critical for preventing early convergence to suboptimal policies by ensuring sufficient exploration of the environment, which helps in overcoming local optima and learning more about the overall landscape of possible actions and states. As learning progresses, epsilon is reduced to prioritize exploitation of the best-known actions, refining the agent's policy for optimal performance .

In the Q-Learning process, the exploration-exploitation trade-off is crucial for balancing learning and performance. Initially, when all Q-values are zero and the agent is unfamiliar with the environment, exploration is emphasized to discover new states and actions, using random actions guided by a high epsilon rate. Over time, as the agent learns which actions yield higher rewards, it reduces the exploration rate, shifting towards exploitation of the learned policy to maximize rewards by selecting actions with the highest Q-values. This trade-off ensures that the agent sufficiently explores the environment to learn effective strategies while exploiting known knowledge to achieve high performance .

Q-Learning and SARSA differ in their policy update methods and action selection mechanisms. Q-Learning is an off-policy algorithm that updates the Q-values using the action that provides the maximum expected reward, following an optimal policy even during training, irrespective of the current policy. SARSA, on the other hand, is an on-policy algorithm where the Q-values are updated based on actions taken following the current policy, represented by the five-element tuple (S, A, R, S1, A1). Q-Learning tends to converge faster towards an optimal policy by always choosing the best theoretical move, whereas SARSA might adapt better to environments where the policy needs to incorporate safety or risk considerations due to its ongoing adherence to the policy .

Q-Learning Tutorial with Numerical Example
100% (1)
Q-Learning Tutorial with Numerical Example
11 pages
RBF Neural Networks Overview and Applications
No ratings yet
RBF Neural Networks Overview and Applications
34 pages
Neuro-Fuzzy Techniques Overview
No ratings yet
Neuro-Fuzzy Techniques Overview
1 page
Data Structures and Algorithms in Pythoniedu - Us
No ratings yet
Data Structures and Algorithms in Pythoniedu - Us
77 pages
Tree Function Computation in Python
No ratings yet
Tree Function Computation in Python
2 pages
Deep Learning Transformers Course Overview
No ratings yet
Deep Learning Transformers Course Overview
55 pages
Coding Neurons in Python: Chapter 2
No ratings yet
Coding Neurons in Python: Chapter 2
31 pages
Top 25 AI Agent Solutions Overview
No ratings yet
Top 25 AI Agent Solutions Overview
8 pages
ISRO Scientist Exam Preparation Tips
No ratings yet
ISRO Scientist Exam Preparation Tips
4 pages
CS229 Machine Learning Syllabus 2024
No ratings yet
CS229 Machine Learning Syllabus 2024
40 pages
AI Transforming Education: Benefits & Challenges
No ratings yet
AI Transforming Education: Benefits & Challenges
10 pages
AI Lab Manual for Sem 6
No ratings yet
AI Lab Manual for Sem 6
22 pages
Deep Learning Explained in Three Figures
No ratings yet
Deep Learning Explained in Three Figures
2 pages
Alomar Et Al 2023 Run Time Assurance Via Real Time Generation of Backup Trajectories and Transverse Dynamics Regulation
No ratings yet
Alomar Et Al 2023 Run Time Assurance Via Real Time Generation of Backup Trajectories and Transverse Dynamics Regulation
13 pages
Image Classification via Backpropagation
No ratings yet
Image Classification via Backpropagation
8 pages
Frame Problem in AI Course Overview
No ratings yet
Frame Problem in AI Course Overview
3 pages
Advanced Programming Practice Record
No ratings yet
Advanced Programming Practice Record
112 pages
PyTorch Fundamentals: Tensors Explained
100% (1)
PyTorch Fundamentals: Tensors Explained
45 pages
Training a Simple Neural Network
No ratings yet
Training a Simple Neural Network
4 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
108 pages
Understanding ANFIS Architecture and Learning
No ratings yet
Understanding ANFIS Architecture and Learning
4 pages
Foundations of AI and Machine Learning
No ratings yet
Foundations of AI and Machine Learning
22 pages
Independent Q-Learning in RL
No ratings yet
Independent Q-Learning in RL
50 pages
Soft Computing Techniques Overview
No ratings yet
Soft Computing Techniques Overview
48 pages
AGV - Control Sub-Systems and Interfaces: Dan Shelley
No ratings yet
AGV - Control Sub-Systems and Interfaces: Dan Shelley
24 pages
ANN Classification Implementation Guide
No ratings yet
ANN Classification Implementation Guide
9 pages
KNN, K-Means, and Regression in Python
No ratings yet
KNN, K-Means, and Regression in Python
12 pages
Deep Reinforcement Learning for Snake
No ratings yet
Deep Reinforcement Learning for Snake
9 pages
Introduction to Convolutional Neural Networks
No ratings yet
Introduction to Convolutional Neural Networks
61 pages
Applications of Evolutionary Computation, Part II
No ratings yet
Applications of Evolutionary Computation, Part II
547 pages
Python for Statistics & Machine Learning
No ratings yet
Python for Statistics & Machine Learning
419 pages
Deep Learning Overview and Applications
No ratings yet
Deep Learning Overview and Applications
49 pages
Principles of Concurrent Programming
No ratings yet
Principles of Concurrent Programming
363 pages
Deep Reinforcement Learning Syllabus
No ratings yet
Deep Reinforcement Learning Syllabus
3 pages
Introduction to Machine Learning Basics
100% (1)
Introduction to Machine Learning Basics
8 pages
Quantum Machine Learning in Medical Imaging
No ratings yet
Quantum Machine Learning in Medical Imaging
12 pages
Design and Analysis of Algorithms Notes
No ratings yet
Design and Analysis of Algorithms Notes
6 pages
Fuzzy & ANN Control Systems in Python
No ratings yet
Fuzzy & ANN Control Systems in Python
20 pages
Tensor Basics in PyTorch
No ratings yet
Tensor Basics in PyTorch
23 pages
An Introduction To Statistical Learning With Applications in R Gareth James Ebook Pro Digital Version
100% (4)
An Introduction To Statistical Learning With Applications in R Gareth James Ebook Pro Digital Version
65 pages
Reinforcement Learning & MDP Overview
No ratings yet
Reinforcement Learning & MDP Overview
19 pages
Search Algorithms in AI Explained
No ratings yet
Search Algorithms in AI Explained
19 pages
Reinforcement Learning Overview
100% (2)
Reinforcement Learning Overview
61 pages
NPTEL Deep Learning Exam Cheatsheet
No ratings yet
NPTEL Deep Learning Exam Cheatsheet
2 pages
Overview of Artificial Intelligence Concepts
No ratings yet
Overview of Artificial Intelligence Concepts
17 pages
Neural Networks Lab Manual for CSE
No ratings yet
Neural Networks Lab Manual for CSE
100 pages
Transformer Models in NLP Explained
No ratings yet
Transformer Models in NLP Explained
21 pages
Diabetes Prediction Using KNN Algorithm
No ratings yet
Diabetes Prediction Using KNN Algorithm
11 pages
Biological Neurons and ANN Comparison
No ratings yet
Biological Neurons and ANN Comparison
4 pages
Perceptron vs. Neuron in Deep Learning
No ratings yet
Perceptron vs. Neuron in Deep Learning
8 pages
AI-Enhanced Robotic Arm Welding System
No ratings yet
AI-Enhanced Robotic Arm Welding System
48 pages
Stanford CS 229 Machine Learning Notes
No ratings yet
Stanford CS 229 Machine Learning Notes
48 pages
PyTorch Tabular Regression Guide
No ratings yet
PyTorch Tabular Regression Guide
13 pages
Kalman Filter - Wikipedia, The Free Encyclopedia
No ratings yet
Kalman Filter - Wikipedia, The Free Encyclopedia
27 pages
Understanding LSTM Networks Explained
No ratings yet
Understanding LSTM Networks Explained
7 pages
Grouping and Grading in Machine Learning
No ratings yet
Grouping and Grading in Machine Learning
9 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
Understanding Q-Learning for Robots
No ratings yet
Understanding Q-Learning for Robots
6 pages
Understanding Q-Learning in AI
No ratings yet
Understanding Q-Learning in AI
26 pages
11 Q Learning
No ratings yet
11 Q Learning
5 pages
Turbo C Graphics Programming Guide
No ratings yet
Turbo C Graphics Programming Guide
11 pages
Introduction to Pandas for Data Analysis
No ratings yet
Introduction to Pandas for Data Analysis
16 pages
Total Response of Continuous Time LTI Systems
No ratings yet
Total Response of Continuous Time LTI Systems
8 pages
Problem Solving in Computer Science
No ratings yet
Problem Solving in Computer Science
6 pages
KNX Home & Building Control Standards
No ratings yet
KNX Home & Building Control Standards
13 pages
SELECT-OPTIONS F4 Help in ABAP Report Programs - SAP Blogs
No ratings yet
SELECT-OPTIONS F4 Help in ABAP Report Programs - SAP Blogs
1 page
Samara University Asset Management System
No ratings yet
Samara University Asset Management System
7 pages
Troubleshooting Computer Hardware Issues
No ratings yet
Troubleshooting Computer Hardware Issues
91 pages
Basic Routing Protocols Overview
No ratings yet
Basic Routing Protocols Overview
4 pages
Mininet and OpenFlow Lab Guide
No ratings yet
Mininet and OpenFlow Lab Guide
13 pages
IT Solutions and Services Overview
No ratings yet
IT Solutions and Services Overview
6 pages
The Future of .NET Development
No ratings yet
The Future of .NET Development
5 pages
ASRock Z390M-ITX/ac Motherboard Guide
No ratings yet
ASRock Z390M-ITX/ac Motherboard Guide
83 pages
Computer Science Quiz Questions
No ratings yet
Computer Science Quiz Questions
6 pages
Cloud Computing Architecture Overview
No ratings yet
Cloud Computing Architecture Overview
29 pages
Particle Swarm Optimization For Target Encirclement by A UAV Formation
No ratings yet
Particle Swarm Optimization For Target Encirclement by A UAV Formation
8 pages
Book Billing System Project Report
No ratings yet
Book Billing System Project Report
20 pages
Luenberger Observer for DC Motor Control
100% (1)
Luenberger Observer for DC Motor Control
3 pages
Genetic Algorithm for TSP in Python
No ratings yet
Genetic Algorithm for TSP in Python
8 pages
Mirza Arif Baig: Computer Science Profile
No ratings yet
Mirza Arif Baig: Computer Science Profile
2 pages
分布式缓存架构与性能分析
No ratings yet
分布式缓存架构与性能分析
205 pages
Odoo Vendor Portal Module Guidelines
No ratings yet
Odoo Vendor Portal Module Guidelines
2 pages
Conditional and Unconditional Jumps in Assembly
No ratings yet
Conditional and Unconditional Jumps in Assembly
47 pages
Data Sheet WD Black sn850 Call of Duty Edition Nvme SSD
No ratings yet
Data Sheet WD Black sn850 Call of Duty Edition Nvme SSD
2 pages
Janisha Sethi: IT Skills & Experience
No ratings yet
Janisha Sethi: IT Skills & Experience
2 pages
Ebrochure ThinkerView 23march2020-1
No ratings yet
Ebrochure ThinkerView 23march2020-1
2 pages
Grade 7 Computer Science Exam Guide
No ratings yet
Grade 7 Computer Science Exam Guide
3 pages
n8n AI Workflow Tutorial PDF
100% (1)
n8n AI Workflow Tutorial PDF
10 pages
CNN Accelerator Design via GSA Method
No ratings yet
CNN Accelerator Design via GSA Method
5 pages
Class 6 Algorithm and Flowchart Quiz
No ratings yet
Class 6 Algorithm and Flowchart Quiz
3 pages

Q-Learning Explained for Robotics

Uploaded by

Q-Learning Explained for Robotics

Uploaded by

Q-Learning — a simplistic overview

Let’s say that a robot has to cross a maze and reach the end

So, how do we solve this?

Introducing the Q-Table

There will be four numbers of actions at each non-edge tile.

So, let’s model this environment in our Q-Table.

But the questions are:

 How do we calculate the values of the Q-table?

There is an iterative process of updating the values. As we

Introducing the Q-learning algorithm process

Step 1: initialize the Q-Table

Steps 2 and 3: choose and perform an action

We will choose an action (a) in the state (s) based on the Q-

As the robot explores the environment, the epsilon rate

During the process of exploration, the robot progressively

Steps 4 and 5: evaluate

We will repeat this again and again until the learning is

It is called an on-policy algorithm because it updates the policy

Basically, the Q-value is updated taking into account the action,

Common questions

In the context of Q-Learning, why is the Q-Table initialized with zeros, and how does updating occur?

What are the potential challenges of implementing Q-Learning in a real-world robotic navigation task involving a maze with hazards like mines?

Explain the significance of the Bellman equation in Q-Learning.

How does the Q-Learning algorithm determine the best action at each state in an environment with a reward system?

Discuss how reward shaping can influence Q-Learning in a structured environment like a maze.

How can Q-Learning handle continuous spaces of states and actions typically found in complex systems?

What are the distinct characteristics and applications of an off-policy algorithm like Q-Learning compared to an on-policy algorithm like SARSA?

How does the epsilon-greedy strategy facilitate the learning process in Q-Learning, and why is it important?

What role does the exploration-exploitation trade-off play in the Q-Learning process?

Compare and contrast Q-Learning and SARSA in terms of policy update and action selection mechanisms.

You might also like