0% found this document useful (0 votes)

22 views33 pages

Reinforcement Learning Techniques Overview

Uploaded by

Akhila Akula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views33 pages

Reinforcement Learning Techniques Overview

Uploaded by

Akhila Akula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

UNIT- V

Passive reinforcement learning – direct utility estimation –

adaptive dynamic programming - temporal- difference learning-
active reinforcement learning- exploration- learning an action
utility function-Generalization in reinforcement learning – policy
search- applications in game playing- applications in robot
control.
Reinforcement learning is a machine learning
training method based on rewarding desired
behaviors and punishing undesired ones.
In general, a reinforcement learning agent -- the
entity being trained -- is able to perceive and
interpret its environment, take actions and learn
through trial and error.
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent
agents ought to take actions in an environment in order to maximize the notion
of cumulative reward. Reinforcement learning is one of three basic machine learning
paradigms, alongside supervised learning and unsupervised learning.
Reinforcement learning differs from supervised learning in not needing labelled
input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly
corrected. Instead the focus is on finding a balance between exploration (of uncharted
territory) and exploitation (of current knowledge).
The environment is typically stated in the form of a Markov decision process (MDP),
because many reinforcement learning algorithms for this context use dynamic
programming [Link] main difference between the classical dynamic programming
methods and reinforcement learning algorithms is that the latter do not assume knowledge
of an exact mathematical model of the MDP and they target large MDPs where exact
methods become infeasible.
The basic elements of an RL problem are:
[Link] — Physical world in which the agent
operates
[Link] — Current situation of the agent
[Link] — Feedback from the environment
[Link] — Method to map agent’s state to actions
[Link] — Future reward that an agent would receive
by taking an action in a particular state
APPLICATIONS OF RL

Reinforcement Learning in news recommendation

Reinforcement Learning in gaming
Real-time bidding— Reinforcement Learning applications in marketing and advertising
Reinforcement Learning in robotics manipulation
Reinforcement Learning in NLP (Natural Language Processing)
Rewards
● Agent takes actions
● Agent occasionally receives reward
● Maybe just at the end of the process,
e.g., Chess: – agent has to decide on individual moves – reward only at end: win/lose
● Maybe more frequently
– Scrabble: points for each word played
– ping pong: any point scored
– baby learning to crawl: any forward movement
An MDP is the mathematical framework which captures such a fully
observable, non-deterministic environment with Markovian
Transition Model and additive rewards in which the agent acts. The
solution to an MDP is an optimal policy which refers to the choice of
action for every state that maximizes overall cumulative reward.
Thus, the transition model that represents an agent’s
environment(when the environment is known) and the optimal
policy which decides what action the agent needs to perform in each
Agent Designs

● Utility based agent

– needs model of environment

– learns utility function on states
– selects action that maximize expected outcome utility

● Q-learning – learns action-utility function (Q(s, a) function)

– does not need to model outcomes of actions

– function provides expected utility of taken a given action at a given step

● Reflex agent – learns policy that maps states to actions

PASSIVE LEARNING
As the goal of the agent is to evaluate how good an optimal policy is, the
agent needs to learn the expected utility Uπ(s) for each state s. This can
be done in three ways.
Direct Utility Estimation
In this method, the agent executes a sequence of trials or
runs (sequences of states-actions transitions that continue until the
agent reaches the terminal state). Each trial gives a sample value and the
agent estimates the utility based on the samples values. Can be
calculated as running averages of sample values. The main drawback
is that this method makes a wrong assumption that state utilities are
independent while in reality they are Markovian. Also, it is slow to
converge.
Suppose we have a 4x3 grid as the environment in which the agent can
move either Left, Right, Up or Down(set of available actions). An example
of a run
Total reward starting at (1,1) = 0.72
PASSIVE
LEARNING

We know which state we are in (= partially observable environment)

● We know which actions we can take
● But only after taking an action
→ new state becomes known
→ reward becomes known
Given a policy

● Task: compute utility of policy

● We will extend this later to active reinforcement learning (⇒ policy needs to be learned)
Utility of Policy
● Definition of utility U of the policy π for state s

U π (s) = E [ ∞ ∑ t=0 γ tR(St)]

Start at state S0 = s
● Reward for state is R(s)
● Discount factor γ (we use γ = 1 in our examples)

Direct Utility Estimation:

Learning from the samples
● Reward to go:

– (1,1) one sample: 0.72

– (1,2) two samples: 0.76, 0.84
– (1,3) two samples: 0.80, 0.88

● Reward to go will converge to utility of state

● But very slowly — can we do better?
1. Bellman Equation

● Direct utility estimation ignores dependency between states

● Given by Bellman equation

U π (s) = R(s) + γ ∑ s ′ P(s ′ ∣s, π(s)) U π(s ′) (γ = reward decay)

● Requires learning of transition probabilities P(s ′ ∣s, π(s))

● Use of this known dependence can speed up learning
1. Adaptive Dynamic Programming

Need to learn:

● State rewards R(s)

– whenever a state is visited, record award (deterministic)

● Outcome of action π(s) at state s according to policy π

– collect statistic count(s, s′) that s ′ is reached from s

– estimate probability distribution

P(s ′ ∣s, π(s)) = count(s, s′) / ∑s ′′ count(s, s′′)

⇒ Ingredients for policy evaluation algorithm

LEARNING CURVE
Idea: do not model P(s ′ ∣s, π(s)), directly adjust utilities U(s) for all visited states
1. Temporal Difference Learning
●
● Estimate of current utility: Uπ(s)
● Estimate of utility after action: R(s) + γUπ(s ′)
● Adjust utility of current state Uπ(s) if they differ

∆U π (s) = α ( R(s) + γUπ (s ′ ) − U π (s)) (α = learning rate)

● Learning rate may decrease when state has been visited often
LEARNING CURVE
Comparison

● Both eventually converge to correct values

● Adaptive dynamic programming (ADP) faster than temporal difference learning (TD)

– both make adjustments to make successors agree

– but: ADP adjusts all possible successors, TD only observed successor

● ADP computationally more expensive due to policy evaluation algorithm

Active Reinforcement Learning

● Previously: passive agent follows prescribed policy

● Now: active agent decides which action to take

– following optimal policy (as currently viewed)

– exploration

● Goal: optimize rewards for a given time frame

Greedy Agent

Start with initial policy

Compute utilities (using ADP)
Optimize policy
Go to Step 2
This very seldom converges to global optimal policy
Greedy in the Limit of Infinite Exploration

● Explore any action in any state unbounded number of times

– carry out optimal policy ⇒ maximize reward

Eventually has to become greedy

Simple strategy with probability p(1/t) take random action

– initially (t small) focus on exploration
– later (t big) focus on optimal policy

R+ is optimistic estimate, best possible award in any state

Extension of Adaptive Dynamic Programming

U(s) ← R(s) + γ maxa ∑ s ′ P(s ′ ∣s, a) U(s ′ )

● Previous definition of utility calculation

U + (s) ← R(s) + γ maxa f (∑ s ′ P(s ′ ∣s, a) U + (s ′ ), N(s, a))

● New utility calculation

● One possible definition of f(u, n)

f(u, n) = { R+
if n <Nc
u otherwise
LEARNING CURVE
Q-Learning
● Learning an action utility function Q(s, a)

Model-free: no explicit transition model P(s ′ ∣s, a)

● Allows computation of utilities U(s) = maxaQ(s, a)
●
● Theoretically correct Q values

Q(s, a) = R(s) + γ ∑ s ′ P(s ′ ∣s, a) maxa′Q(s ′ , a′ )

● Update formula inspired by temporal difference learning

∆Q(s, a) = α(R(s) + γ maxa′ Q(s ′ , a′ ) − Q(s, a))

● For our example, Q-learning slower, but successful applications (TD-GAMMON)

Generalization in reinforcement learning
Large Scale Reinforcement Learning

● Adaptive dynamic programming (ASP) scalable to maybe 10,000 states

– Backgammon has 1020 states
– Chess has 1040 states

● It is not possible to visit all these states multiple times ⇒ Generalization of states
needed
Function Approximation
● Define state utility function as linear combination of features Uˆ θ(s) =
θ1 f1(s) + θ2 f2(s) + ... + θnfn(s)
● Recall: features to assess Chess state
– f1(s) = (number of white pawns) – (number of black pawns)
– f2(s) = (number of white rooks) – (number of black rooks)
– f3(s) = (number of white queens) – (number of black queens)
– f4(s) = king safety
– f5(s) = good pawn position – etc.

⇒ Reduction from 1040 to, say, 20 parameters

Main benefit: ability to assess unseen states

Learning Feature Weights
● Example: 2 features: x and y
Uˆ θ(f1, f2) = θ0 + θ1f1 + θ2f2
● Current feature weights θ0, θ1, θ2 = (0.5, 0.2, 0.1)
● Model’s prediction of utility of specific state, e.g., Uˆ θ(1, 1) = 0.8
● Sample set of trials, found value uθ(1, 1) = 0.4
● Error Eθ = 1 2 (Uˆ θ(f1, f2) − uθ(f1, f2))2
● How do you update the weights θi?
Gradient Descent Training
Compute gradient of error
Update against gradient
dEθdθi = (Uˆ θ(f1, f2) − uθ(f1, f2)) fi

∆θi = −µ dEθdθi

● Our example
– ∆θ1 = −µ(Uˆ θ(f1, f2) − uθ(f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
– ∆θ2 = −µ(Uˆ θ(f1, f2) − uθ(f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ

[Link] Remarks

⇒we may want to use more complex features

● If we know something about the problem

● Our toy example: utility related to Manhattan distance from goal (xgoal, ygoal)
f3(s) = (x − xgoal) + (y − ygoal)
● Gradient descent training can also be used for temporal distance learning
Policy Search
● Idea: directly optimize policy
● Policy may be parameterized Q functions, hence:
π(s) = argmaxaQˆ θ(s, a)
● Stochastic policy, e.g., given by softmax function
πθ(s, a) = 1 Zs e Qˆ θ (s,a)
● Policy value ρ(θ): expected reward if πθ is carried out

Hill climbing

⇒ optimizing policy value ρ(θ) may be done in closed form

● Deterministic policy, deterministic environment

⇒gradient descent by following policy gradient

● If ρ(θ) differentiable

⇒hillclimb if ρ(θ) improves

● Make small changes to parameters

More complex for stochastic environment

Summary
● Building on Markov decision processes and machine learning
● Passive reinforcement learning (fixed policy, partially observable
environment, stochastic outcomes of actions)
– sampling (carrying out trials)
– adaptive dynamic programming
– temporal difference learning

● Active reinforcement learning

– greedy in the limit of infinite exploration
– following optimal policy vs. exploration
– exploratory adaptive dynamic programming

● Generalization: representing utility function with small set of parameters

● Policy search

Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
20 pages
Passive Reinforcement Learning Overview
No ratings yet
Passive Reinforcement Learning Overview
43 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
7 pages
Reinforcement Learning Explained
No ratings yet
Reinforcement Learning Explained
45 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
22 pages
Direct Utility Estimation in Reinforcement Learning
No ratings yet
Direct Utility Estimation in Reinforcement Learning
32 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
22 pages
Ai Unit-5 PDF
No ratings yet
Ai Unit-5 PDF
33 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
30 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
5 pages
Reinforcement Learning & NLP Overview
No ratings yet
Reinforcement Learning & NLP Overview
22 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
34 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
58 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
59 pages
Reinforcement Learning: TD and ADP Methods
No ratings yet
Reinforcement Learning: TD and ADP Methods
16 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
61 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
30 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
30 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
16 pages
09 RL
No ratings yet
09 RL
58 pages
Reinforcement Learning Overview Guide
No ratings yet
Reinforcement Learning Overview Guide
12 pages
Unit 5 Ai ..
No ratings yet
Unit 5 Ai ..
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
18 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
34 pages
Reinforcement Learning Reinforcement Learning
No ratings yet
Reinforcement Learning Reinforcement Learning
24 pages
Unit-IV Passive Reinforcement Learning
No ratings yet
Unit-IV Passive Reinforcement Learning
6 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
16 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
13 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
23 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
47 pages
UC Berkeley Reinforcement Learning Guide
No ratings yet
UC Berkeley Reinforcement Learning Guide
46 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
4 pages
Basics of Reinforcement Learning
No ratings yet
Basics of Reinforcement Learning
15 pages
Reinforcement Learning Crash Course
No ratings yet
Reinforcement Learning Crash Course
84 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
57 pages
06 Reinforcement Learning
No ratings yet
06 Reinforcement Learning
12 pages
Reinforcement Learning Overview and Methods
No ratings yet
Reinforcement Learning Overview and Methods
23 pages
Independent Q-Learning in RL
No ratings yet
Independent Q-Learning in RL
50 pages
Advanced Reinforcement Learning Course
No ratings yet
Advanced Reinforcement Learning Course
33 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
18 pages
MonteCarlo RL
No ratings yet
MonteCarlo RL
50 pages
Module1 A
No ratings yet
Module1 A
20 pages
Reinforcement Learning Concepts Explained
No ratings yet
Reinforcement Learning Concepts Explained
45 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
8 pages
Markov Decision Processes in Q-Learning
No ratings yet
Markov Decision Processes in Q-Learning
23 pages
09 Rlie
No ratings yet
09 Rlie
71 pages
Reinforcement Learning for Self-Driving Cars
No ratings yet
Reinforcement Learning for Self-Driving Cars
57 pages
Reinforcement Learning for Robot Walking
No ratings yet
Reinforcement Learning for Robot Walking
9 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
52 pages
Reinforcement Learning and MDP Explained
No ratings yet
Reinforcement Learning and MDP Explained
11 pages
Overview of Reinforcement Learning
No ratings yet
Overview of Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
8 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
26 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
57 pages
UNIT-II Lecture Notes 4-2 CS-1
No ratings yet
UNIT-II Lecture Notes 4-2 CS-1
44 pages
Understanding Tree Models in Machine Learning
No ratings yet
Understanding Tree Models in Machine Learning
16 pages
Linear Models and Regression Techniques
No ratings yet
Linear Models and Regression Techniques
92 pages
Tree and Rule Models in Machine Learning
No ratings yet
Tree and Rule Models in Machine Learning
25 pages
Multi-Phased Attack Strategies Explained
100% (1)
Multi-Phased Attack Strategies Explained
14 pages
B.Tech Computer Organization Exam Paper
No ratings yet
B.Tech Computer Organization Exam Paper
2 pages
Exam Questions on Computer Systems
No ratings yet
Exam Questions on Computer Systems
3 pages
Genogram in Personal Development Lesson
No ratings yet
Genogram in Personal Development Lesson
6 pages
Understanding Verb Derivation in English
100% (1)
Understanding Verb Derivation in English
9 pages
Diction Software For Content Analysis
100% (1)
Diction Software For Content Analysis
27 pages
Understanding Cell Theory Development
No ratings yet
Understanding Cell Theory Development
3 pages
BSOG-173 Rethinking Development Assignments
No ratings yet
BSOG-173 Rethinking Development Assignments
4 pages
Second Language Acquisition Overview
No ratings yet
Second Language Acquisition Overview
8 pages
Importance of Quantitative Research Across Fields
No ratings yet
Importance of Quantitative Research Across Fields
20 pages
Inclusive Education: UDL for ADHD Students
No ratings yet
Inclusive Education: UDL for ADHD Students
11 pages
Hướng Dẫn Chấm Thi Tiếng Anh Chuyên
100% (1)
Hướng Dẫn Chấm Thi Tiếng Anh Chuyên
8 pages
Transition to Outcome-Based Education
No ratings yet
Transition to Outcome-Based Education
9 pages
Ladder of Feedback Protocol Overview
No ratings yet
Ladder of Feedback Protocol Overview
1 page
The Feminization of Human Nature
80% (10)
The Feminization of Human Nature
18 pages
Blended Learning: Rhyming Lesson Plan
No ratings yet
Blended Learning: Rhyming Lesson Plan
4 pages
Diagnostic Test
No ratings yet
Diagnostic Test
10 pages
Evolution of Information Design
71% (7)
Evolution of Information Design
42 pages
Understanding Multimodal Texts
No ratings yet
Understanding Multimodal Texts
4 pages
Grade 4 English Diagnostic Test 2021
100% (1)
Grade 4 English Diagnostic Test 2021
3 pages
Understanding Grammar with Graciela
No ratings yet
Understanding Grammar with Graciela
2 pages
IPAQ Short Form Questionnaire Overview
100% (1)
IPAQ Short Form Questionnaire Overview
4 pages
Survey of Attention Mechanisms in CV
No ratings yet
Survey of Attention Mechanisms in CV
38 pages
Impact of English Speaking Policy at SJTIT
91% (32)
Impact of English Speaking Policy at SJTIT
105 pages
Games, Puzzles To Keep Your Mind Sharp
No ratings yet
Games, Puzzles To Keep Your Mind Sharp
724 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
30 pages
Data Structures & Algorithms Guide
No ratings yet
Data Structures & Algorithms Guide
42 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
10 pages
Content-Based Instruction Overview
No ratings yet
Content-Based Instruction Overview
25 pages
Teacher Feedback's Impact on Student Motivation
No ratings yet
Teacher Feedback's Impact on Student Motivation
7 pages
The Logic of What Might Have Actually Be
No ratings yet
The Logic of What Might Have Actually Be
5 pages
Rāmānuja and Śaṅkara on Vedānta Categories
No ratings yet
Rāmānuja and Śaṅkara on Vedānta Categories
16 pages
Normative Data for Victoria Stroop Test
No ratings yet
Normative Data for Victoria Stroop Test
17 pages

Reinforcement Learning Techniques Overview

Uploaded by

Reinforcement Learning Techniques Overview

Uploaded by

UNIT- V

Passive reinforcement learning – direct utility estimation –

Reinforcement Learning in news recommendation

● Utility based agent

– needs model of environment

● Q-learning – learns action-utility function (Q(s, a) function)

– does not need to model outcomes of actions

● Reflex agent – learns policy that maps states to actions

We know which state we are in (= partially observable environment)

● Task: compute utility of policy

U π (s) = E [ ∞ ∑ t=0 γ tR(St)]

Direct Utility Estimation:

– (1,1) one sample: 0.72

● Reward to go will converge to utility of state

● Direct utility estimation ignores dependency between states

U π (s) = R(s) + γ ∑ s ′ P(s ′ ∣s, π(s)) U π(s ′) (γ = reward decay)

● Requires learning of transition probabilities P(s ′ ∣s, π(s))

● State rewards R(s)

● Outcome of action π(s) at state s according to policy π

– collect statistic count(s, s′) that s ′ is reached from s

P(s ′ ∣s, π(s)) = count(s, s′) / ∑s ′′ count(s, s′′)

⇒ Ingredients for policy evaluation algorithm

∆U π (s) = α ( R(s) + γUπ (s ′ ) − U π (s)) (α = learning rate)

● Both eventually converge to correct values

– both make adjustments to make successors agree

– but: ADP adjusts all possible successors, TD only observed successor

● ADP computationally more expensive due to policy evaluation algorithm

● Previously: passive agent follows prescribed policy

● Now: active agent decides which action to take

– following optimal policy (as currently viewed)

● Goal: optimize rewards for a given time frame

Start with initial policy

● Explore any action in any state unbounded number of times

– carry out optimal policy ⇒ maximize reward

Simple strategy with probability p(1/t) take random action

R+ is optimistic estimate, best possible award in any state

U(s) ← R(s) + γ maxa ∑ s ′ P(s ′ ∣s, a) U(s ′ )

U + (s) ← R(s) + γ maxa f (∑ s ′ P(s ′ ∣s, a) U + (s ′ ), N(s, a))

● One possible definition of f(u, n)

Model-free: no explicit transition model P(s ′ ∣s, a)

Q(s, a) = R(s) + γ ∑ s ′ P(s ′ ∣s, a) maxa′Q(s ′ , a′ )

● Update formula inspired by temporal difference learning

● For our example, Q-learning slower, but successful applications (TD-GAMMON)

● Adaptive dynamic programming (ASP) scalable to maybe 10,000 states

⇒ Reduction from 1040 to, say, 20 parameters

Main benefit: ability to assess unseen states

⇒we may want to use more complex features

⇒ optimizing policy value ρ(θ) may be done in closed form

⇒gradient descent by following policy gradient

⇒hillclimb if ρ(θ) improves

More complex for stochastic environment

● Active reinforcement learning

● Generalization: representing utility function with small set of parameters

You might also like