0% found this document useful (0 votes)
88 views9 pages

Q-Learning Explained for Robotics

The document describes how Q-learning can be used to train a robot to navigate a maze and reach the goal point without stepping on mines. The robot's position in the maze is modeled as states in a Q-table, while possible movements are actions. Initially random, the robot learns over time to take advantageous actions by updating the Q-table values using rewards and penalties. Through iterative training, the robot exploits what it has learned to find the shortest path to the goal.

Uploaded by

Surya Mudaliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views9 pages

Q-Learning Explained for Robotics

The document describes how Q-learning can be used to train a robot to navigate a maze and reach the goal point without stepping on mines. The robot's position in the maze is modeled as states in a Q-table, while possible movements are actions. Initially random, the robot learns over time to take advantageous actions by updating the Q-table values using rewards and penalties. Through iterative training, the robot exploits what it has learned to find the shortest path to the goal.

Uploaded by

Surya Mudaliar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Q-Learning — a simplistic overview

Let’s say that a robot has to cross a maze and reach the end


point. There are mines, and the robot can only move one tile
at a time. If the robot steps onto a mine, the robot is dead.
The robot has to reach the end point in the shortest time
possible.
The scoring/reward system is as below:

1. The robot loses 1 point at each step. This is done so that the
robot takes the shortest path and reaches the goal as fast as
possible.
2. If the robot steps on a mine, the point loss is 100 and the game
ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to
reach the end goal with the shortest path without stepping on a
mine?

So, how do we solve this?

Introducing the Q-Table


Q-Table is just a fancy name for a simple lookup table where
we calculate the maximum expected future rewards for action
at each state. Basically, this table will guide us to the best
action at each state.

There will be four numbers of actions at each non-edge tile.


When a robot is at a state it can either move up or down or
right or left.

So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions and the rows are
the states.
Each Q-table score will be the maximum expected future
reward that the robot will get if it takes that action at that
state. This is an iterative process, as we need to improve the
Q-Table at each iteration.

But the questions are:

 How do we calculate the values of the Q-table?


 Are the values available or predefined?
To learn each value of the Q-table, we use the Q-Learning
algorithm.
Mathematics: the Q-Learning algorithm
Q-function
The Q-function uses the Bellman equation and takes two
inputs: state (s) and action (a).

Using the above function, we get the values of Q for the cells
in the table.
When we start, all the values in the Q-table are zeros.

There is an iterative process of updating the values. As we


start to explore the environment, the Q-function gives us
better and better approximations by continuously updating the
Q-values in the table.
Now, let’s understand how the updating takes place.

Introducing the Q-learning algorithm process


Each of the colored boxes is one step. Let’s understand each
of these steps in detail.

Step 1: initialize the Q-Table


We will first build a Q-table. There are n columns, where n=
number of actions. There are m rows, where m= number of
states. We will initialise the values at 0.
In our robot example, we have four actions (a=4) and five
states (s=5). So we will build a table with four columns and
five rows.

Steps 2 and 3: choose and perform an action


This combination of steps is done for an undefined amount of
time. This means that this step runs until the time we stop the
training, or the training loop stops as defined in the code.

We will choose an action (a) in the state (s) based on the Q-


Table. But, as mentioned earlier, when the episode initially
starts, every Q-value is 0.
So now the concept of exploration and exploitation trade-off
comes into play. This article has more details.
We’ll use something called the epsilon greedy strategy.
In the beginning, the epsilon rates will be higher. The robot
will explore the environment and randomly choose actions.
The logic behind this is that the robot does not know anything
about the environment.

As the robot explores the environment, the epsilon rate


decreases and the robot starts to exploit the environment.

During the process of exploration, the robot progressively


becomes more confident in estimating the Q-values.

For the robot example, there are four actions to choose from:
up, down, left, and right. We are starting the training now —
our robot knows nothing about the environment. So the robot
chooses a random action, say right.

We can now update the Q-values for being at the start and
moving right using the Bellman equation.

Steps 4 and 5: evaluate


Now we have taken an action and observed an outcome and
[Link] need to update the function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward
structure is:

 power = +1
 mine = -100
 end = +100

We will repeat this again and again until the learning is


stopped. In this way the Q-Table will be updated.
Python implementation of Q-Learning
The concept and code implementation are explained in my
video.
Subscribe to my YouTube channel For more AI
videos : ADL .
At last…let us recap
 Q-Learning is a value-based reinforcement learning algorithm
which is used to find the optimal action-selection policy using
a Q function.
 Our goal is to maximize the value function Q.
 The Q table helps us to find the best action for each state.
 It helps to maximize the expected reward by selecting the best
of all possible actions.
 Q(state, action) returns the expected future reward of that
action at that state.
 This function can be estimated using Q-Learning, which
iteratively updates Q(s,a) using the Bellman equation.
 Initially we explore the environment and update the Q-Table.
When the Q-Table is ready, the agent will start to exploit the
environment and start taking better actions.
SARSA is an on-policy algorithm where, in the current state, S
an action, A is taken and the agent gets a reward, R and ends up
in next state, S1 and takes action, A1 in S1. Therefore, the tuple
(S, A, R, S1, A1) stands for the acronym SARSA.

It is called an on-policy algorithm because it updates the policy


based on actions taken.

SARSA vs Q-learning
The difference between these two algorithms is
that SARSA chooses an action following the same current
policy and updates its Q-values whereas Q-learning chooses
the greedy action, that is, the action that gives the maximum Q-
value for the state, that is, it follows an optimal policy.

Basically, the Q-value is updated taking into account the action,


A1 performed in the state, S1 in SARSA as opposed to Q-
learning where the action with the highest Q-value in the next
state, S1 is used to update Q-table.

Common questions

Powered by AI

The Q-Table is initialized with zeros in Q-Learning because the agent starts with no prior information about the environment or expected rewards. As the agent interacts with the environment, it updates the Q-Values iteratively using the Bellman equation, which considers the immediate reward received and the maximum expected future rewards. This process updates the Q-values towards more accurate estimations based on the actions taken and observed outcomes, thereby improving the agent's policy over time .

Implementing Q-Learning in real-world robotic navigation faces several challenges despite its theoretical robustness. One major issue is ensuring sufficient exploration without risking frequent collisions with hazards like mines, which may require carefully tuning exploration parameters such as the epsilon in epsilon-greedy strategies. Additionally, the computational and memory demands for maintaining and updating the Q-table could be significant, especially in environments with vast state-action spaces. Real-time constraints and the need for adaptive responses further complicate this implementation, along with issues of sensor reliability and environment model inaccuracies, which must be managed to realize practical deployments .

The Bellman equation is integral to Q-Learning as it provides a recursive method to compute the expected return of choosing an action at a given state. It allows the agent to update its Q-values based on immediate rewards and the estimated future value of the resulting state, serving as the backbone for optimizing the policy. The equation enables breaking down the problem of finding optimal actions into smaller, solvable units, providing a structured approach to value iteration and policy improvement. Its iterative nature helps refine estimates through repeated interactions with the environment, ensuring convergence towards the optimal policy .

The Q-Learning algorithm determines the best action at each state by using a Q-table, which is a lookup table containing the maximum expected future rewards for actions at each state. It uses the Bellman equation to iteratively update the Q-values in the table. Initially, the Q-values are zero, and they are updated based on the rewards received for the actions taken. The algorithm employs an exploration-exploitation trade-off strategy, initially exploring the environment with random actions to learn, and gradually shifting to exploiting learned actions that maximize expected rewards .

Reward shaping can significantly impact Q-Learning by adjusting the structure and magnitude of rewards to guide the agent's learning process. In a maze environment, incorporating incremental rewards for approaching the goal and penalties for specific actions or hazards can accelerate learning and ensure more desirable behavior paths. It reduces the sparse reward problem, providing continuous feedback that assists the agent in understanding which state-action pairs contribute positively to achieving the goal. However, care must be taken to maintain consistency between shaped and original rewards to prevent the development of unintended strategies or convergence issues .

Q-Learning traditionally deals with discrete state-action spaces, but in continuous systems, function approximation methods such as neural networks can be used in place of traditional Q-tables to estimate Q-values. This adapation, known as a Deep Q Network (DQN), uses deep learning to create a continuous approximation of the Q-value functions, enabling Q-Learning to generalize across vast and continuous state-action spaces efficiently. These modifications leverage the capacity of neural networks to learn feature representations, thus seamlessly applying the principles of Q-learning to complex continuous environments .

Off-policy algorithms, such as Q-Learning, select and evaluate actions using potentially different policies. The agent updates its Q-values based on what is deemed optimal, regardless of the policy currently used to make actions, allowing for faster convergence to optimal solutions. In contrast, on-policy algorithms like SARSA update Q-values by considering the current policy to evaluate and select actions. This approach can be advantageous in dynamic settings where actions have consequences on future states and require careful adaptation, leading to policies more attuned to risk management. Q-Learning is typically applied in scenarios where the environment model is unknown and the optimal performance is desired, while SARSA is better suited for environments where adhering to an in-training policy is crucial due to potential state-action dependencies or safety considerations .

The epsilon-greedy strategy facilitates the learning process in Q-Learning by balancing exploration and exploitation. It allows the agent to select random actions with probability epsilon, encouraging exploration of unknown states, while choosing the best-known action with probability (1-epsilon), enabling exploitation. This strategy is critical for preventing early convergence to suboptimal policies by ensuring sufficient exploration of the environment, which helps in overcoming local optima and learning more about the overall landscape of possible actions and states. As learning progresses, epsilon is reduced to prioritize exploitation of the best-known actions, refining the agent's policy for optimal performance .

In the Q-Learning process, the exploration-exploitation trade-off is crucial for balancing learning and performance. Initially, when all Q-values are zero and the agent is unfamiliar with the environment, exploration is emphasized to discover new states and actions, using random actions guided by a high epsilon rate. Over time, as the agent learns which actions yield higher rewards, it reduces the exploration rate, shifting towards exploitation of the learned policy to maximize rewards by selecting actions with the highest Q-values. This trade-off ensures that the agent sufficiently explores the environment to learn effective strategies while exploiting known knowledge to achieve high performance .

Q-Learning and SARSA differ in their policy update methods and action selection mechanisms. Q-Learning is an off-policy algorithm that updates the Q-values using the action that provides the maximum expected reward, following an optimal policy even during training, irrespective of the current policy. SARSA, on the other hand, is an on-policy algorithm where the Q-values are updated based on actions taken following the current policy, represented by the five-element tuple (S, A, R, S1, A1). Q-Learning tends to converge faster towards an optimal policy by always choosing the best theoretical move, whereas SARSA might adapt better to environments where the policy needs to incorporate safety or risk considerations due to its ongoing adherence to the policy .

You might also like