Reinforcement Learning Notes
UNIT-I: Reinforcement Learning Foundation
1. Introduction to Reinforcement Learning (RL)
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with
an environment. It learns by receiving rewards or punishments for actions it takes. The goal is to maximize the cumulative
reward over time.
Example: A robot learning to walk. It receives positive rewards for steps that help it move forward and negative rewards if it
falls.
2. Key Terms in Reinforcement Learning
- Agent: The entity making decisions. Example: A self-driving car navigating through traffic.
- Environment: The world the agent interacts with. Example: The roads, traffic lights, and other cars for a self-driving
vehicle.
- State (S): A representation of the current situation in the environment. Example: The robot’s current position in a room.
- Action (A): The possible decisions or moves the agent can take. Example: Moving left, right, forward, or backward.
- Reward (R): Feedback from the environment that tells the agent how good or bad its action was. Example: +10 for moving
toward a goal, -10 for colliding with an obstacle.
- Policy (π): The strategy used by the agent to choose actions based on the current state. Example: In a maze, the agent might
have a policy to always turn right.
3. Features and Elements of RL
- Exploration vs. Exploitation: Exploration involves trying new actions to discover unknown rewards, while exploitation
uses known actions to maximize the reward.
Example: A slot machine where an agent has to decide whether to try a new machine (exploration) or stick with the one
that’s been paying well (exploitation).
- Delayed Rewards: Rewards are not always immediate; actions may affect future rewards.
Example: A robot might move away from an immediate reward to reach a higher reward later.
4. Markov Decision Process (MDP)
An MDP is a mathematical framework to describe the RL problem. It consists of:
- S: Set of states
- A: Set of actions
- P: Transition probabilities, P(s'|s,a) — the probability of transitioning to state s' given action a in state s.
- R: Reward function
- γ (Gamma): Discount factor, representing how much future rewards are valued.
Example (Numerical): Two states S1 and S2. Two actions A1 and A2. Transition probabilities: P(S2|S1, A1) = 0.8, P(S1|S1,
A1) = 0.2. Reward: R(S1, A1) = 5, R(S2, A2) = 10.
5. Policies and Value Functions
- Policy (π): A mapping from states to actions, where π(s) defines the action the agent will take in state s.
- State-Value Function V(s): The expected cumulative reward starting from state s following policy π.
Example: In a grid world, V(s) could represent the value of being at a particular cell based on future rewards.
- Action-Value Function Q(s,a): The expected cumulative reward starting from state s, taking action a, and then following
policy π.
Example: In a maze, Q(s,a) could represent the value of moving left or right from a specific position.
6. Bellman Equations
The Bellman Equation provides a recursive definition for V(s) and Q(s,a).
- State-Value Bellman Equation: V(s) = max_a [R(s, a) + γ Σ P(s'|s,a) V(s')]
- Action-Value Bellman Equation: Q(s,a) = R(s,a) + γ Σ P(s'|s,a) max_{a'} Q(s',a')
Example (Numerical): Let’s say we are in state S1, and can move to S2 with a reward of 10. Using the Bellman equation:
V(S1) = max {10 + γ V(S2)}
7. Exploration vs. Exploitation
Epsilon-Greedy Strategy: This is a popular way to balance exploration and exploitation. The agent chooses a random action
with probability ε (exploration) and the best-known action with probability 1-ε (exploitation).
Example (Numerical): Consider a bandit problem: Two slot machines: Machine 1 gives a reward of 10 with probability 0.8,
Machine 2 gives a reward of 10 with probability 0.5. Using epsilon-greedy with ε = 0.1, the agent will explore 10% of the
time and exploit 90% of the time.
Code Example:
import numpy as np
def epsilon_greedy(Q, state, epsilon):
if [Link]() < epsilon:
return [Link](len(Q[state])) # Explore: random action
else:
return [Link](Q[state]) # Exploit: choose best-known action
8. Code Standards and Libraries used in RL
Python: The primary language for RL development due to its rich ecosystem.
TensorFlow & Keras: Popular deep learning libraries often used for implementing neural network models in RL.
Stable Baselines3: A popular RL library that provides implementations of state-of-the-art algorithms like PPO, A2C, DDPG.
Example Code: Simple Q-Learning using Stable-Baselines3
from stable_baselines3 import DQN
env = [Link]('CartPole-v1')
model = DQN('MlpPolicy', env, verbose=1)
[Link](total_timesteps=10000)
UNIT-II: Policy Optimization
1. Policy-Based Methods
Unlike value-based methods that approximate value functions (e.g., Q-Learning), policy-based methods directly optimize the
policy.
Vanilla Policy Gradient (VPG): Optimizes the policy by following the gradient of expected reward with respect to the policy
parameters.
Gradient Update Rule:
∇J(θ) = E_{τ} [∇_θ log π_θ(a|s) ⋅ R(τ)]
Code Example (REINFORCE Algorithm):
import numpy as np
def reinforce(policy, gamma=0.99):
rewards = [ ] # List of rewards for an episode
gradients = [ ] # Gradients to accumulate
for t in range(len([Link])):
action_prob = policy.get_action_probability([Link][t])
loss = -[Link](action_prob) * [Link][t]
[Link](loss)
for t in range(len([Link])):
policy.update_weights(gradients[t], rewards[t])
policy = PolicyNetwork()
reinforce(policy)
2. Actor-Critic Methods
Advantage Actor-Critic (A2C): Combines value-based and policy-based methods. The actor updates the policy, while the
critic evaluates the action using value functions.
Advantage Function:
A(s,a) = Q(s,a) - V(s)
Asynchronous Advantage Actor-Critic (A3C): Parallelizes training by running multiple agents in different environments
simultaneously.
3. Advanced Policy Gradient (PPO, TRPO, DDPG)
- Proximal Policy Optimization (PPO): A more stable version of policy gradient that uses a clipped objective function to
ensure the update doesn’t change too drastically.
- Trust Region Policy Optimization (TRPO): Uses a constraint on the step size during policy update to improve stability.
- Deep Deterministic Policy Gradient (DDPG): Combines the benefits of Q-learning and policy gradients for continuous action
spaces.
UNIT-III: Recent Advances and Applications
1. Meta-Learning
Meta-learning focuses on improving the learning process itself, allowing an agent to learn how to adapt quickly to new tasks.
Example: In a robotics task, a robot can learn how to adapt to different terrains, such as sand or gravel, after seeing a few
examples.
2. Multi-Agent Reinforcement Learning (MARL)
Multiple agents interact with the same environment and learn either cooperatively or competitively.
Example: Self-driving cars coordinating with each other on a road to avoid collisions.
3. Partially Observable Markov Decision Process (POMDP)
In POMDPs, the agent cannot observe the complete state of the environment, so it must maintain a belief about the state
based on past actions and observations.
Example: A robot navigating through a smoky room where its sensors can’t fully detect the environment.
The Actor-Critic method is a combination of two key approaches in reinforcement learning: policy-based (Actor) and
value-based (Critic). Let’s break this down into simple terms, with a numerical example to make it easier to understand.
What is Actor-Critic?
In the Actor-Critic method:
The Actor is responsible for choosing actions. It learns the policy (strategy), which is like a guide for selecting the
best action in any given situation.
The Critic evaluates how good the action was by calculating the value of that action and state. It learns the value
function, which helps to guide the Actor.
These two parts work together:
The Critic tells the Actor how good or bad its actions are by giving feedback.
The Actor uses this feedback to improve its policy, so it picks better actions over time.
Key Concepts:
1. Actor: Learns a policy (probability distribution of actions). It suggests which action to take next.
2. Critic: Learns a value function that estimates how good a certain state or action is. It helps guide the Actor by
evaluating the choices made.
3. Advantage: The Critic calculates the difference between the actual reward and what was expected (this is called the
advantage). The Actor uses this advantage to adjust its policy.
Now, let's go through a numerical example to make this clearer.
Example: Actor-Critic in a Simple Gridworld
Imagine a 2x2 grid where an agent (robot) starts at position (0, 0), and its goal is to reach (1, 1) to get a reward of +1. The
robot gets a small penalty of -0.01 for each move it makes to encourage faster decisions.
Step 1: Initial Setup
The robot can move up, down, left, or right.
The goal is to maximize the total reward, i.e., reach the goal in the fewest steps possible.
The robot starts with a random policy for picking actions and an estimate of how good each state is.
Initial Policy (Actor): For example, the robot initially has equal chances to move up, down, left, or right from any state:
π(up∣s)=0.25, π(down∣s)=0.25, π(left∣s)=0.25, π(right∣s)=0.25