0% found this document useful (0 votes)
22 views33 pages

Reinforcement Learning Techniques Overview

Uploaded by

Akhila Akula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views33 pages

Reinforcement Learning Techniques Overview

Uploaded by

Akhila Akula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

UNIT- V

Passive reinforcement learning – direct utility estimation –


adaptive dynamic programming - temporal- difference learning-
active reinforcement learning- exploration- learning an action
utility function-Generalization in reinforcement learning – policy
search- applications in game playing- applications in robot
control.
Reinforcement learning is a machine learning
training method based on rewarding desired
behaviors and punishing undesired ones.
In general, a reinforcement learning agent -- the
entity being trained -- is able to perceive and
interpret its environment, take actions and learn
through trial and error.
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent
agents ought to take actions in an environment in order to maximize the notion
of cumulative reward. Reinforcement learning is one of three basic machine learning
paradigms, alongside supervised learning and unsupervised learning.
Reinforcement learning differs from supervised learning in not needing labelled
input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly
corrected. Instead the focus is on finding a balance between exploration (of uncharted
territory) and exploitation (of current knowledge).
The environment is typically stated in the form of a Markov decision process (MDP),
because many reinforcement learning algorithms for this context use dynamic
programming [Link] main difference between the classical dynamic programming
methods and reinforcement learning algorithms is that the latter do not assume knowledge
of an exact mathematical model of the MDP and they target large MDPs where exact
methods become infeasible.
The basic elements of an RL problem are:
[Link] — Physical world in which the agent
operates
[Link] — Current situation of the agent
[Link] — Feedback from the environment
[Link] — Method to map agent’s state to actions
[Link] — Future reward that an agent would receive
by taking an action in a particular state
APPLICATIONS OF RL

Reinforcement Learning in news recommendation


Reinforcement Learning in gaming
Real-time bidding— Reinforcement Learning applications in marketing and advertising
Reinforcement Learning in robotics manipulation
Reinforcement Learning in NLP (Natural Language Processing)
Rewards
● Agent takes actions
● Agent occasionally receives reward
● Maybe just at the end of the process,
e.g., Chess: – agent has to decide on individual moves – reward only at end: win/lose
● Maybe more frequently
– Scrabble: points for each word played
– ping pong: any point scored
– baby learning to crawl: any forward movement
An MDP is the mathematical framework which captures such a fully
observable, non-deterministic environment with Markovian
Transition Model and additive rewards in which the agent acts. The
solution to an MDP is an optimal policy which refers to the choice of
action for every state that maximizes overall cumulative reward.
Thus, the transition model that represents an agent’s
environment(when the environment is known) and the optimal
policy which decides what action the agent needs to perform in each
Agent Designs

● Utility based agent

– needs model of environment


– learns utility function on states
– selects action that maximize expected outcome utility

● Q-learning – learns action-utility function (Q(s, a) function)

– does not need to model outcomes of actions


– function provides expected utility of taken a given action at a given step

● Reflex agent – learns policy that maps states to actions


PASSIVE LEARNING
As the goal of the agent is to evaluate how good an optimal policy is, the
agent needs to learn the expected utility Uπ(s) for each state s. This can
be done in three ways.
Direct Utility Estimation
In this method, the agent executes a sequence of trials or
runs (sequences of states-actions transitions that continue until the
agent reaches the terminal state). Each trial gives a sample value and the
agent estimates the utility based on the samples values. Can be
calculated as running averages of sample values. The main drawback
is that this method makes a wrong assumption that state utilities are
independent while in reality they are Markovian. Also, it is slow to
converge.
Suppose we have a 4x3 grid as the environment in which the agent can
move either Left, Right, Up or Down(set of available actions). An example
of a run
Total reward starting at (1,1) = 0.72
PASSIVE
LEARNING

We know which state we are in (= partially observable environment)


● We know which actions we can take
● But only after taking an action
→ new state becomes known
→ reward becomes known
Given a policy

● Task: compute utility of policy


● We will extend this later to active reinforcement learning (⇒ policy needs to be learned)
Utility of Policy
● Definition of utility U of the policy π for state s

U π (s) = E [ ∞ ∑ t=0 γ tR(St)]


Start at state S0 = s
● Reward for state is R(s)
● Discount factor γ (we use γ = 1 in our examples)

Direct Utility Estimation:


Learning from the samples
● Reward to go:

– (1,1) one sample: 0.72


– (1,2) two samples: 0.76, 0.84
– (1,3) two samples: 0.80, 0.88

● Reward to go will converge to utility of state


● But very slowly — can we do better?
1. Bellman Equation

● Direct utility estimation ignores dependency between states


● Given by Bellman equation

U π (s) = R(s) + γ ∑ s ′ P(s ′ ∣s, π(s)) U π(s ′) (γ = reward decay)

● Requires learning of transition probabilities P(s ′ ∣s, π(s))


● Use of this known dependence can speed up learning
1. Adaptive Dynamic Programming

Need to learn:

● State rewards R(s)


– whenever a state is visited, record award (deterministic)

● Outcome of action π(s) at state s according to policy π

– collect statistic count(s, s′) that s ′ is reached from s


– estimate probability distribution

P(s ′ ∣s, π(s)) = count(s, s′) / ∑s ′′ count(s, s′′)

⇒ Ingredients for policy evaluation algorithm


LEARNING CURVE
Idea: do not model P(s ′ ∣s, π(s)), directly adjust utilities U(s) for all visited states
1. Temporal Difference Learning

● Estimate of current utility: Uπ(s)
● Estimate of utility after action: R(s) + γUπ(s ′)
● Adjust utility of current state Uπ(s) if they differ

∆U π (s) = α ( R(s) + γUπ (s ′ ) − U π (s)) (α = learning rate)


● Learning rate may decrease when state has been visited often
LEARNING CURVE
Comparison

● Both eventually converge to correct values

● Adaptive dynamic programming (ADP) faster than temporal difference learning (TD)

– both make adjustments to make successors agree

– but: ADP adjusts all possible successors, TD only observed successor

● ADP computationally more expensive due to policy evaluation algorithm


Active Reinforcement Learning

● Previously: passive agent follows prescribed policy

● Now: active agent decides which action to take

– following optimal policy (as currently viewed)

– exploration

● Goal: optimize rewards for a given time frame


Greedy Agent

Start with initial policy


Compute utilities (using ADP)
Optimize policy
Go to Step 2
This very seldom converges to global optimal policy
Greedy in the Limit of Infinite Exploration

● Explore any action in any state unbounded number of times

– carry out optimal policy ⇒ maximize reward


Eventually has to become greedy

Simple strategy with probability p(1/t) take random action


– initially (t small) focus on exploration
– later (t big) focus on optimal policy

R+ is optimistic estimate, best possible award in any state


Extension of Adaptive Dynamic Programming

U(s) ← R(s) + γ maxa ∑ s ′ P(s ′ ∣s, a) U(s ′ )


● Previous definition of utility calculation

U + (s) ← R(s) + γ maxa f (∑ s ′ P(s ′ ∣s, a) U + (s ′ ), N(s, a))


● New utility calculation

● One possible definition of f(u, n)


f(u, n) = { R+
if n <Nc
u otherwise
LEARNING CURVE
Q-Learning
● Learning an action utility function Q(s, a)

Model-free: no explicit transition model P(s ′ ∣s, a)


● Allows computation of utilities U(s) = maxaQ(s, a)

● Theoretically correct Q values

Q(s, a) = R(s) + γ ∑ s ′ P(s ′ ∣s, a) maxa′Q(s ′ , a′ )

● Update formula inspired by temporal difference learning


∆Q(s, a) = α(R(s) + γ maxa′ Q(s ′ , a′ ) − Q(s, a))

● For our example, Q-learning slower, but successful applications (TD-GAMMON)


Generalization in reinforcement learning
Large Scale Reinforcement Learning

● Adaptive dynamic programming (ASP) scalable to maybe 10,000 states


– Backgammon has 1020 states
– Chess has 1040 states

● It is not possible to visit all these states multiple times ⇒ Generalization of states
needed
Function Approximation
● Define state utility function as linear combination of features Uˆ θ(s) =
θ1 f1(s) + θ2 f2(s) + ... + θnfn(s)
● Recall: features to assess Chess state
– f1(s) = (number of white pawns) – (number of black pawns)
– f2(s) = (number of white rooks) – (number of black rooks)
– f3(s) = (number of white queens) – (number of black queens)
– f4(s) = king safety
– f5(s) = good pawn position – etc.

⇒ Reduction from 1040 to, say, 20 parameters

Main benefit: ability to assess unseen states


Learning Feature Weights
● Example: 2 features: x and y
Uˆ θ(f1, f2) = θ0 + θ1f1 + θ2f2
● Current feature weights θ0, θ1, θ2 = (0.5, 0.2, 0.1)
● Model’s prediction of utility of specific state, e.g., Uˆ θ(1, 1) = 0.8
● Sample set of trials, found value uθ(1, 1) = 0.4
● Error Eθ = 1 2 (Uˆ θ(f1, f2) − uθ(f1, f2))2
● How do you update the weights θi?
Gradient Descent Training
Compute gradient of error
Update against gradient
dEθdθi = (Uˆ θ(f1, f2) − uθ(f1, f2)) fi

∆θi = −µ dEθdθi

● Our example
– ∆θ1 = −µ(Uˆ θ(f1, f2) − uθ(f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ
– ∆θ2 = −µ(Uˆ θ(f1, f2) − uθ(f1, f2)) fi = −µ(0.8 − 0.4) 1 = −0.4µ

[Link] Remarks

⇒we may want to use more complex features


● If we know something about the problem

● Our toy example: utility related to Manhattan distance from goal (xgoal, ygoal)
f3(s) = (x − xgoal) + (y − ygoal)
● Gradient descent training can also be used for temporal distance learning
Policy Search
● Idea: directly optimize policy
● Policy may be parameterized Q functions, hence:
π(s) = argmaxaQˆ θ(s, a)
● Stochastic policy, e.g., given by softmax function
πθ(s, a) = 1 Zs e Qˆ θ (s,a)
● Policy value ρ(θ): expected reward if πθ is carried out

Hill climbing

⇒ optimizing policy value ρ(θ) may be done in closed form


● Deterministic policy, deterministic environment

⇒gradient descent by following policy gradient


● If ρ(θ) differentiable

⇒hillclimb if ρ(θ) improves


● Make small changes to parameters

More complex for stochastic environment


Summary
● Building on Markov decision processes and machine learning
● Passive reinforcement learning (fixed policy, partially observable
environment, stochastic outcomes of actions)
– sampling (carrying out trials)
– adaptive dynamic programming
– temporal difference learning

● Active reinforcement learning


– greedy in the limit of infinite exploration
– following optimal policy vs. exploration
– exploratory adaptive dynamic programming

● Generalization: representing utility function with small set of parameters


● Policy search

You might also like