•Module-3
Module 3 RL ALGO
• REINFORCEMENT ALGORITHMS AND APPLICATIONS:
• Algorithms for control learning, Q-learning,
• Discrete action space:
• SARSA – Lambda,
• DQN- Deep Q Network.
• Continuous action space:
• Deep Deterministic Policy Gradient (DDPG),
• Asynchronous Advantage Actor-Critic Algorithm(A3C).
•
• Multi-Armed Bandit Problem
• The multi-armed bandit problem is used in reinforcement learning to
formalize the notion of decision-making under uncertainty. In a
multi-armed bandit problem, an agent(learner) chooses
between k different actions and receives a reward based on the chosen
action.
The multi-armed bandits are also used to describe fundamental
concepts in reinforcement learning, such as rewards, timesteps,
and values.
• The Multi-Armed Bandit (MAB) problem is a classic problem in
probability theory and decision-making that captures the essence of
balancing exploration and exploitation
• This problem is named after the scenario of a gambler facing multiple
slot machines (bandits) and needing to determine which machine to
play to maximize their rewards. The MAB problem has significant
applications in various fields, including online advertising, clinical
trials, adaptive routing in networks, and more.
• The exploration vs exploitation dilemma exists in many aspects of our
life.
• The dilemma comes from the incomplete information.
• For selecting an action by an agent, we assume that each action has a
separate distribution of rewards and there is at least one action that
generates maximum numerical reward.
• Thus, the probability distribution of the rewards corresponding to each
action is different and is unknown to the agent(decision-maker).
• Hence, the goal of the agent is to identify which action to choose to
get the maximum reward after a given set of trials.
• Imagine you are in a casino facing multiple slot machines and each is
configured with an unknown probability of how likely you can get a
reward at one play. The question is: What is the best strategy to
achieve highest long-term rewards?
ε-Greedy Algorithm
epsilon refers to the probability of choosing
to explore, exploits most of the time with a
small chance of exploring.
To solve MBP , epsilon
greedy algo is used.
• Implementation work // Practise
• [Link]
• [Multi-Armed Bandit Problem and Epsilon-Greedy Action Value
Method in Python: Reinforcement Learning]
• [Introduction to OpenAI Gym and Frozen Lake Environment in
Python- Reinforcement Learning Tutorial]
• [Link]
• Intro to Open AI Gym (Frozen Lake environment):
[Link]
• Intro to Open AI Gym (Cart Pole environment):
[Link]
• Intro to Open AI Gym (Atari Game environment):
[Link]
• Classification of RL Algo
Control Algorithms
• algorithm used to control, coordinate, and optimize
• The most common algorithms can be classified as fuzzy-based
algorithms, neural networks-based algorithms, predictive control
algorithms, machine learning algorithms, and combined control
algorithms. For example, Pappis and Mamdani (1977) developed the
first fuzzy-based control algorithm for traffic junctions.
• GOTO is a control algorithm in which the master robot commands the
slave robot to drive itself from its current operational position to
specified positions.
• Urban Traffic Management System Based on Ontology and
Multiagent System
• Dong Shen and Songhang Chen State Key Laboratory of Management
and Control for Complex Systems, Institute of Automation, Chinese
Academy of Sciences, Beijing, China
What is a reinforcement learning problem and why are controls important for them?
• Reinforcement learning is a problem where an agent tries to learn how
to maximize its reward by interacting with its environment.
• Decisions, also known as controls, are important for reinforcement
learning problems because they help agents learn which control in a
given state is associated with the greatest rewards.
• Many different reinforcement learning algorithms help discover what
the best controls (those controls that lead to the greatest rewards) are
• individual weaknesses:
• reinforcement learning algorithms could encourage exploration early
on during an agent’s training to avoid getting stuck in local optima or
suboptimal policies.
• In addition, reinforcement learning control algorithms can be used to
help an agent recover from mistakes and return to the task at hand.
Q learning
• Q-learning is a model-free, off-policy reinforcement learning that will
find the best course of action, given the current state of the agent.
Depending on where the agent is in the environment, it will decide the
next action to be taken.
• To do this, it may come up with rules of its own or it may operate
outside the policy given to it to follow. This means that there is no
actual need for a policy, hence we call it off-policy.
• Model-free means that the agent uses predictions of the environment’s
expected response to move forward. It does not use the reward system
to learn, but rather, trial and error.
• "model-free" means an algorithm learns by interacting
directly with the environment without needing to build a
predictive model of how the environment works, relying on
trial and error to discover optimal actions.
•"off-policy" means the algorithm can learn from data
generated by a different policy (behavior policy) than the one
it is currently trying to optimize (target policy)
Using Q-learning, we can optimize the ad
recommendation system to recommend products that
are frequently bought together. The reward will be if the
user clicks on the suggested product.
Activity
• Case study of Q learning based Ad Recommendation systems
• Sequence Modelling _Aviation Industry
• Design of Q learning
Important Terms in Q-Learning
States: The State, S, represents the current position of an agent in an
environment.
Action: The Action, A, is the step taken by the agent when it is in a
particular state.
Rewards: For every action, the agent will get a positive or negative
reward.
Episodes: When an agent ends up in a terminating state and can’t take a
new action.
Q-Values: Used to determine how good an Action, A, taken at a
particular state, S, is. Q (A, S).
Temporal Difference: A formula used to find the Q-Value by using the
value of current state and action and previous state and action.
Bellman Equation?
• The Bellman Equation is used to determine the value of a particular
state and deduce how good it is to be in/take that state. The optimal
state will give us the highest optimal value.
How to Make a Q-Table?
• While running our algorithm, we will come across various solutions
and the agent will take multiple paths. How do we find out the best
among them? This is done by tabulating our findings in a table called a
Q-Table.
• A Q-Table helps us to find the best action for each state in the
environment. We use the Bellman Equation at each state to get the
expected future state and reward and save it in a table to compare with
other states.
AWS ROBOMAKER
[Link]
• Numerical on Q learning
• Q-Learning ( Start State, Policy, Table and function, Goal state)
•{
• Step 1 : Decide Policy : Qπ ( st, at) // s and a represent state and actions at any time t
• Step 2: Agent uses policy and based on observation ; best action is selected.
• Step 3: Immediate reward points for correct actions and negative points for bad actions has been
generated . //
• Step 4: Q learning depends Q functions and Q table.
• Step 5: Q -function used Bellman equation // Generalized form of policy
• Qπ (st ,at) = E ( Rt+1 + Rt+2 + -------------------+ ¥n Rt+n )
• Step 6 : Maintain Q Table
• Step 7: Until agent reaches to goral states else repeat from initial state (Step 1)
• }
Class Activity _ 17/02/2025
1. Implementation of Grid world agent navigation problem Using
❖ Q learning and SARSA
❖ Compare Both results
SARSA Reinforcement Learning
• SARSA algorithm is a slight variation of the popular Q-Learning
algorithm. For a learning agent in any Reinforcement Learning
algorithm, it’s policy can be of two types:-
1. On Policy: In this, the learning agent learns the value function
according to the current action derived from the policy currently being
used.
2. Off Policy: In this, the learning agent learns the value function
according to the action derived from another policy.
• Q-Learning technique is an Off Policy technique and uses the greedy
approach to learn the Q-value.
• SARSA technique, on the other hand, is an On Policy and uses the
action performed by the current policy to learn the Q-value.
• SARSA (State-Action-Reward-State-Action) is a reinforcement
learning algorithm that is used to learn a policy for an agent interacting
with an environment.
• It is a type of on-policy algorithm, which means that it learns the value
function and the policy based on the actions that are actually taken by
the agent.
• The SARSA algorithm works by maintaining a table of action-value
estimates Q(s, a), where s is the state and a is the action taken by the
agent in that state.
• The table is initialized to some arbitrary values, and the agent uses an
epsilon-greedy policy to select actions.
SARSA algorithm
•
SARSA algo : FrozenLake-v0’
• Reinforcement Learning in Python: Implementing SARSA Agent
in Taxi Environment
• [Link]
mplementing-sarsa-agent-in-taxi-environment/?ref=asr2