0% found this document useful (0 votes)
95 views1 page

Reinforcement Learning Course Overview

syllabus

Uploaded by

pillipramod8096
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views1 page

Reinforcement Learning Course Overview

syllabus

Uploaded by

pillipramod8096
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R18 [Link].

CSE (AIML) III & IV Year JNTU Hyderabad

REINFORCEMENT LEARNING

[Link]. IV Year I Sem. L T P C


2 0 0 2

Course Objectives: Knowledge on fundamentals of reinforcement learning and the methods used to
create agents that can solve a variety of complex tasks.

Course Outcomes
1. Understand basics of RL.
2. Understand RL Framework and Markov Decision Process.
3. Analyzing ning through the use of Dynamic Programming and Monte Carlo.
4. Understand TD(0) algorithm, TD(λ) algorithm.

UNIT - I
Basics of probability and linear algebra, Definition of a stochastic multi-armed bandit, Definition of
regret, Achieving sublinear regret, UCB algorithm, KL-UCB, Thompson Sampling.

UNIT - II
Markov Decision Problem, policy, and value function, Reward models (infinite discounted, total, finite
horizon, and average), Episodic & continuing tasks, Bellman's optimality operator, and Value iteration
& policy iteration

UNIT - III
The Reinforcement Learning problem, prediction and control problems, Model-based algorithm, Monte
Carlo methods for prediction, and Online implementation of Monte Carlo policy evaluation

UNIT - IV
Bootstrapping; TD(0) algorithm; Convergence of Monte Carlo and batch TD(0) algorithms; Model-free
control: Q-learning, Sarsa, Expected Sarsa.

UNIT - V
n-step returns; TD(λ) algorithm; Need for generalization in practice; Linear function approximation and
geometric view; Linear TD(λ). Tile coding; Control with function approximation; Policy search; Policy
gradient methods; Experience replay; Fitted Q Iteration; Case studies.

TEXT BOOKS:
1. “Reinforcement learning: An introduction,” First Edition, Sutton, Richard S., and Andrew G.
Barto, MIT press 2020.
2. “Statistical reinforcement learning: modern machine learning approaches,” First Edition,
Sugiyama, Masashi. CRC Press 2015.

REFERENCE BOOKS:
1. “Bandit algorithms,” First Edition, Lattimore, T. and C. Szepesvári. Cambridge University Press.
2020.
2. “Reinforcement Learning Algorithms: Analysis and Applications,” Boris Belousov, Hany
Abdulsamad, Pascal Klink, Simone Parisi, and Jan Peters First Edition, Springer 2021.
3. Alexander Zai and Brandon Brown “Deep Reinforcement Learning in Action,” First Edition,
Manning Publications 2020.

Common questions

Powered by AI

The primary objectives of studying reinforcement learning in this curriculum are to acquire knowledge on the fundamentals of reinforcement learning and to learn the methods used to create agents capable of solving a variety of complex tasks. This involves understanding and applying different reinforcement learning frameworks and algorithms to real-world problems .

Bellman's optimality operator plays a critical role in reinforcement learning by providing a recursive equation to determine the optimal policy and value function for MDPs. It serves as the foundation for various algorithms, such as value iteration and policy iteration, which iteratively update estimates of value functions to converge towards the optimal policy .

Experience replay provides advantages in reinforcement learning algorithms by improving data efficiency and stabilization of training. It allows an agent to break temporal correlations by learning from past experiences stored in a replay memory, reducing variance, and enabling the reuse of experience, which facilitates more robust learning .

In the context of stochastic multi-armed bandits, 'regret' refers to the difference between the reward obtained by following a particular strategy and the reward that could have been obtained by always choosing the best possible action. Minimizing regret involves strategies such as the Upper Confidence Bound (UCB) algorithm, KL-UCB, and Thompson Sampling, which balance exploration and exploitation to maximize expected rewards over time .

Monte Carlo policy evaluation is significant for online implementation in reinforcement learning as it enables the estimation of value functions from sample episodes directly, without requiring a model of the environment. This makes it well-suited for environments where obtaining a model is difficult, allowing incremental policy improvement by simulating actual experience and updating policies based on empirical rewards .

Monte Carlo methods and TD(0) are both used for prediction in reinforcement learning, but they differ primarily in their approach to updating value estimates. Monte Carlo methods require complete episodes of experience before making updates, averaging over many episodes, while TD(0) updates estimates incrementally after each step using bootstrapping, which combines immediate rewards with discounted future rewards .

Policy gradient methods differ from value-based methods by directly optimizing the policy function instead of estimating value functions. While value-based methods, such as Q-learning, aim to determine the best action-value function, policy gradient methods adjust the parameters of a policy function to maximize expected rewards, allowing for learning in environments with large or continuous action spaces .

The Markov Decision Process (MDP) framework contributes to solving reinforcement learning problems by providing a formalized model that defines the environment in which an agent interacts. It is characterized by states, actions, rewards, transition probabilities, and a policy that dictates the agent's actions. The goal is to find an optimal policy that maximizes the cumulative reward over time .

Generalization is crucial in reinforcement learning to handle large or continuous state spaces where learning a value or policy for each possible state is not feasible. It is achieved through function approximation techniques such as linear function approximation, tile coding, and neural networks, which allow for the estimation of value functions across similar states, thereby facilitating learning in complex environments .

Function approximation introduces challenges such as instability and divergence in learning algorithms. These issues are addressed through techniques including experience replay, which stabilizes learning by averaging over previously seen experiences; target networks that stabilize updates; and regularization of function approximators to prevent overfitting .

You might also like