0% found this document useful (0 votes)

12 views7 pages

Introduction to Reinforcement Learning

Reinforcement learning involves an agent learning to maximize rewards through interactions with an unknown environment, often modeled as a Markov decision problem. The document discusses passive reinforcement learning, where the agent follows a fixed policy to learn state utilities, and adaptive dynamic programming, which combines learning with dynamic programming to update state utilities based on observed transitions. It also covers temporal difference learning, which updates utilities using observed reinforcements, and emphasizes the importance of exploration in learning optimal policies.

Uploaded by

skaasc.csdept

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views7 pages

Introduction to Reinforcement Learning

Uploaded by

skaasc.csdept

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Reinforcement learning

In many domains it is difficult to give a precise evaluation function, which would allow
an agent to evaluate the effectiveness or correctness of her actions, except after she
has reached a terminal state. In the terminal state the agent by default obtains an
objective evaluation of her actions which is termed a reward, or a reinforcement. It
is convenient to state the agent’s task so that it would have to act in such a way to
maximize this reward or reinforcement. This is called reinforcement learning.

In a general case the agent may not have full information about her environment, as
well as a precise, or any, description of her actions. The situation of this agent is
similar to one of the possible statement of a full task of artificial intelligence — the
agent is placed in an environment she does not know, and cannot act in it. She has to
learn to act in this environment effectively, to maximize some criterion, available as
reinforcements.

We shall consider a probabilistic model of the agent’s action outcomes. In fact, we

shall assume that we are dealing with an unknown Markov decision problem (MDP).

Reinforcement learning — introduction 1 Reinforcement learning — introduction 2

Passive reinforcement learning Trials

Let us first consider passive reinforcement learning, where we assume that the Recall the 4 × 3 environment we studied earlier:
agent’s policy π(s) is fixed. Agent is therefore bound to do what the policy dictates,
although the outcomes of her actions are probabilistic. The agent may watch what is 3 +1
3 +1 0.812 0.868 0.918
happening, so she knows what states she is reaching and what rewards she gets there.

2 2 0.762 0.660 –1
−1
The agent’s jobs is to learn the utilities of the states U π (s), computed according to
the equation:
∞ 1 0.705 0.655 0.611 0.388
 

U π (s) = E  γ t R(st ) 1
X

t=0
1 2 3 4 1 2 3 4

For the 4x3 example environment used here for illustration we will assume γ = 1. The agent executes the trials where she performs actions according to the policy, until
reaching a terminal state, and receives percepts indicating both the current state and
the reinforcement. Example trials:

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

Reinforcement learning — passive 3 Reinforcement learning — passive 4

Direct utility determination By averaging over a large number of samples she can determine the subsequent
approximations of the expected value of state utilities, which converge in the infinity to
The objective of the agent is to compute the state utilities U π (s) generated by the the real expected values. This way the reinforcement learning task is reduced to
current policy π(s). The state utilities are defined as expected values of the sums of a simple inductive learning.
the (discounted) reinforcements received by the agent, who started from a given state,
and is acting according to her policy: This approach works, but is not very efficient, since it requires a large number of trials.
The problem is, that the algorithm, by using simple averaging, ignores important
∞ information contained in the trials, namely, that state utilities in neighboring states are
 

U π (s) = E  γ t R(st )
X

t=0 related.
The agent may approximate the above quantity using the trials by computing the
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
reward-to-go for each state visited along the way. At the end of each trial the agent
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
takes the reward obtained in that state to be its utility, and then, backtracking along
the trial, computes the reward-to-go for subsequent states, by summing up the rewards (1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

obtained in the last section of the trial.1

For example, for the trial: For example, in the second trial of the previous example, the algorithm evaluates the
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1 utility of state (3,2) as the reward-to-go only from this trial, but ignores the fact, that
the successor state in this state is (3,3), which has an already known, significantly
we get Rtg (4, 3) = 1, Rtg (3, 3) = 0.96, Rtg (2, 3) = 0.92, Rtg (1, 3) = 0.88, Rtg (1, 2) = higher utility. The Bellman equation relates the utilities of successor states, but this
0.84, Rtg (1, 3) = 0.80, Rtg (1, 2) = 0.76, Rtg (1, 1) = 0.72 approach cannot take advantage of this.
1 In case of discounting coefficient γ 6= 1.0 the rewards should be properly discounted.

Reinforcement learning — passive 5 Reinforcement learning — passive 6

Adaptive dynamic programming Adaptive dynamic programming — the algorithm
The adaptive dynamic programming (ADP) is a process similar to the dynamic
programming, combined with learning the model of the environment, ie. the state
transition probability distribution and the reward function. It works by counting the
transitions from the state-action pairs to the next states. The trials provide the
training data of such transitions. The agent can compute the probability of the
transitions as frequencies of their occurrences in the trials.

For example, in the presented trials, the action → (Right) was executed three times
in the state (1,3). Two of these times the successor state was (2,3), so the agent
should compute P ((2, 3)|(1, 3), Right) = 32 .
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

After executing every single action the agent updates the state utilities by solving the
(simplified) Bellman equation using one of the appropriate methods. The equation is
simplified because we only know the distribution of the effects of actions belonging to
the policy, and we cannot compute the best action for each state. Since we are
computing the U π we take just these actions.

Reinforcement learning — ADP method 7 Reinforcement learning — ADP method 8

Adaptive dynamic programming — efficiency

The ADP algorithm updates the utility values as best as it is possible, and in this
respect it is a standard that the other algorithms can be compared to. The policy
computation procedure, which solves a system of linear equations, can be intractable
for problems with large state spaces (eg. for the backgammon game we get 1050
equations with 1050 unknowns).
0.6
1 (4,3)
(3,3) 0.5
RMS error in utility

0.8 (1,3)
Utility estimates

0.4
(1,1)
0.6 (3,2) 0.3
0.4 0.2

0.2 0.1

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials

The above charts show the convergence for an example learning experiment for the
4 × 3 environment. In this experiment the first trial ending in the “bad” terminal state
first occurs around the 78th trial, which causes large updates to some utility values.

Reinforcement learning — ADP method 9 Reinforcement learning — ADP method 10

Temporal difference learning Temporal difference learning — the algorithm

Instead of solving the full equation system for each trial, it is possible to update the
utilities using the currently observed reinforcements. Such algorithm is called the
temporal difference (TD) method:

U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))

In this case the single utility value is updated from a single observed state transition,
instead of the expected value of all transitions. This is why we take this correction —
the difference between the utility of the move and the utility of the state — reduced by
a factor α < 1. This introduces small corrections after each move. The correction
converges to zero when the state utility becomes equal to the discounted utility of
a move.

Note that this method does not require having a model of the environment P (s′|s, a),
nor does it compute one.

Reinforcement learning — temporal difference method 11 Reinforcement learning — temporal difference method 12
Convergence of the temporal difference method The convergence of another example learning experiment for the 4 × 3 environment:
0.6
1 (4,3)
There is a close relationship and similarity between the ADP and TD algorithms. While (3,3) 0.5

RMS error in utility

the latter makes only local updates in utility values, their average values converge to 0.8 (1,3)

Utility estimates
(1,1) 0.4
the same values as for the ADP algorithm. 0.6
(2,1)
0.3
0.4 0.2
In case of learning with many transition examples, the frequencies of state occurrences
0.2 0.1
agrees with their probability distributions, and it can be proved that the utilities
converge to the correct values. 0 0
0 100 200 300 400 500 0 20 40 60 80 100
Number of trials Number of trials

For this to occur the learning parameter α should decrease with the number of
processed trials. More precisely, the subsequent values of this parameter should satisfy
the requirements:
∞
α(n) = ∞
X

n=1
and also: ∞
α2(n) < ∞
X

n=1

Reinforcement learning — temporal difference method 13 Reinforcement learning — temporal difference method 14

2
Active reinforcement learning

RMS error, policy loss

RMS error
1.5 Policy loss
What should do an agent, which does not have a predefined policy, or who would like
to find an optimal one?
1
3
First she should compute the complete transition model for all the actions. The ADP +1

0.5
algorithm for a passive agent can be used for that. After that, the optimal policy,
2 –1
satisfying the Bellman equation, can be determined, like for a regular Markov decision
process: 0 1
0 50 100 150 200 250 300 350 400 450 500
Number of trials 1 2 3 4

U (s) = R(s) + γ max P (s′|s, a)U (s′)

a
s′

The agent can use the value iteration or the policy iteration algorithm. Then, having The chart on the left shows the result of learning for a sample experiment. The agent
determined the optimal policy she could simply execute this policy. found a direct path to the [+1] solution in the 39th trial, but this was the worse path,
along the states: (2,1), (3,1), (3,2), (3,3). This has, however, determined the agent’s
computed optimal policy, on the right. It turns out to be typical in this environment,
But is this really what she should do?
that the agent only rarely arrives at the optimal policy preferring the “upper” path:
(1,2), (1,3), (2,3), (3,3).

Why is this happening?

Reinforcement learning — active 15 Reinforcement learning — active 16

Exploration The exploration policy

If the agent, while learning the rules of the world, performed an arbitrarily imposed In order to combine an effective world exploration with the exploitation of the acquired
policy, then she did not build a correct model of the world (P (s′|s, a)), because she did knowledge, an agent must have some exploration policy. This is necessary to ensure
not learn all her possibilities and the consequences of her possible actions. She did not that the agent is able to learn all available actions to the extent permitting her to
learn them because she did not perform them, because they were not part of the policy. calculate the globally optimal policy.

If, instead, the agent had not only followed the imposed policy, or after initially A simple exploration policy would be to execute some random actions in all states
following it, had not gone straight to calculating the optimal policy, but had performed some fraction of the time, and follow the optimal policy the rest of the time.
other actions for a while, she would have had a chance to learn more about her world.
This approach works, but is slow to converge. It would be better to prefer the
This is the trade-off between exploitation of existing knowledge and exploration of exploration of relatively unexplored state-action pairs, while at the same time avoiding
the environment. An agent should not too quickly accept the learned world model, and the pairs already better known and believed to be of low utility.
the calculated associated policy. If she wants to learn an accurate model of the world,
and then determine the truly optimal policy, she should try different possibilities.

Moreover, she must repeatedly try all the actions in all the states if she wants to avoid
having a random unlucky series prevent her from discovering some particularly good
move. However, eventually she will want to start executing the optimal policy in order
to reap the full benefits of the knowledge she has developed.

Reinforcement learning — exploration 17 Reinforcement learning — exploration 18

The exploration function The fact, that the update formula for U + in the right hand side also uses U + is
important. Since the states and actions surrounding the starting state will be executed
A reasonable exploration policy can be achieved by introducing the optimistic multiple times, if the agent used non-optimistic utilities for updating, she might get
approximation of utilities U +(s): discouraged from such states, and start to avoid “setting out”. Using U + instead
means, that the optimistic values generated around the newly explored regions will be
propagated back, and the actions leading to unknown regions will be favored.
 

U +(s) ← R(s) + γ max f  P (s′|s, a)U +(s′), N (a, s)

a
s′

where N (a, s) is the number of previous choices of action a in state s, and f (u, n) is
the exploration function, trading off the greed (large values of u) against curiosity
(small values of n).

Obviously the f function should be increasing with respect to u and decreasing with
respect to n. A simple definition of f can be:

R+ if n < Ne



f (u, n) = 


u otherwise

where the R+ denotes the optimistic estimate of the best reward possible to obtain in
any state, and the Ne is the minimal number of times that the agent will want to try
each state-action pair.

Reinforcement learning — exploration 19 Reinforcement learning — exploration 20

2.2
(1,1) 1.4
2 (1,2)
RMS error, policy loss

1.8 (1,3) 1.2 RMS error

(2,3) Policy loss
Utility estimates

1.6 (3,2) 1
(3,3)
1.4 (4,3) 0.8
1.2 0.6
1 0.4
0.8
0.2
0.6
0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials

The chart on the left shows the results of learning with exploration. The policy close to
optimal was reached after 18 trials. Note that the utility values converge more slowly
(RMS error) than the optimal policy is determined (policy loss denotes the maximum
loss from executing a suboptimal policy).

Reinforcement learning — exploration 21 Reinforcement learning — exploration 22

Active temporal difference (TD-)learning The Q-function

The method of temporal differences can also be applied to active learning. The agent An alternative to temporal difference learning of utilities is the Q-learning method,
might not have a fixed policy, and still calculate the state utilities using the same which learns an action-value function Q(s, a), which represents the expected utility
update formula as in the passive case: of taking a given action in a given state. The Q-function is related to the utility
function U with the formula:
U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))
U (s) = max
a
Q(s, a)

Given the computed utilities the agent can determine the optimal actions for each
state using the utilities of successor states. It can be proved, that the active TD agent Furthermore, the optimal policy can be extracted from the Q-function as follows:
will eventually arrive at the same resulting utility values as the active ADP agent.
π ∗(s) = argmax Q(s, a)
a

This represents a significant advantage of using — and learning — the Q-function

instead of the utilities U . Even having the full distribution of utilities, the agent must
also know the probability distribution P (s′|s, a) to compute the optimal action in each
state.
The Q-function gives this directly (almost, just by looking up the max).
The P (s′|s, a) distribution need not be learned or known.

Reinforcement learning — TD-learning 23 Reinforcement learning — Q-learning 24

A version of a Bellman equation can also be written for Q-values, deriving from the Q-learning — updating by temporal differences
fact that the expected total reward for taking an action is its immediated reward plus
the discounted utility of the outcome state: The above form of the Bellman equation can be used to derive the TD update formula
for the Q values, by analogy to the TD update for utilities:
′ ′
Q(s, a) = R(s) + γ P (s |s, a)U (s )
X

s′
Q(s, a) ← Q(s, a) + α(R(s) + γ max
′
Q(s′, a′) − Q(s, a))
a
and the equation with respect to the Q-values only is:
This update is calculated whenever action a is executed in state s leading to the result
state s′. Here the term (R(s) + γ maxa′ Q(s′, a′) − Q(s, a)) represents an error in the
Q(s, a) = R(s) + γ P (s′|s, a) max Q(s′, a′)
X

s′ ′ a known Q values and the update tries to minimize it.

The above equation can further be extended to the more general case of the reward We should once again note that the TD Q-learning agent does not need or learn the
function depending on the previous state and action taken therein, in addition to the transition model P (s′|s, a), so this represents a model-free approach to learning.
goal state, so the expected reward has to be computed, instead of taking just the fixed This method can be applied even in very complex domains, but on the other hand the
value: Q-learning agent has no way of looking into the future, so it may have difficulty when
rewards are sparse and long action sequences must be taken to reach them.
Q(s, a) = P (s′|s, a)[R(s, a, s′) + γ max Q(s′, a′)]
X

′ a
s′ Q-learning with temporal difference updating converges to the result much slower than
the ADP algorithm since it does not enforce computing the full consistency of the
model, which it does not have. The consequence it a much faster computation.

Reinforcement learning — Q-learning 25 Reinforcement learning — Q-learning 26

The full Q-learning algorithm with exploration

In general an active Q-learning agent requires exploration, just as it is in the case with
the ADP method. That is why the algorithm uses the exploration function f and the
action frequency table N . With a simpler exploration function (eg. executing some
fraction of random moves) the N table might not be necessary.

Reinforcement learning — Q-learning 27 Reinforcement learning — Q-learning 28

SARSA — State-Action-Reward-State-Action Q-learning is a more flexible algorithm, as it permits the agent to learn the optimal
behavior even if she currently executes a policy different from the best learned
There exists a variant of the Q-learning algorithm with a temporal difference update patterns. On the other hand, SARSA is more realistic, since, for example, if the agent
method called SARSA (State-Action-Reward-State-Action): was unable to fully control her policy, then it would be better for her to learn the
patterns corresponding to what will in fact happen to her, than to learn the best
possible patterns.
Q(s, a) ← Q(s, a) + α(R(s) + γQ(s′, a′) − Q(s, a))
SARSA updates take into account five elements: s, a, r, s′, a′. While the Q-learning Both Q-learning and SARSA are capable of learning the optimal policy for the 4x3
algorithm adjustments are based on the best action selected for the state reached by example environment, albeit more slowly than the ADP (in terms of the number of
the action a, SARSA considers which action was in fact selected. Therefore, eg. for iterations). This is because the local updates they make do not enforce the full
a greedy agent executing only exploitation, these two approaches would give the same consistency of the Q function.
result.

However, in case of exploration the difference is significant. Q-learning is an off-policy

learning algorithm, computing the best possible Q values, regardless of where the
current policy makes the agent go. On the other hand, SARSA is an on-policy
algorithm, which is more appropriate for the agent acting according to the policy.

Reinforcement learning — Q-learning 29 Reinforcement learning — Q-learning 30

Generalization in reinforcement learning

The above reinforcement learning algorithms assume an explicit representation of the

U (s) or Q(s) function, such as, eg. table representation. This may be practical only up
to some size of the problem.

For example, for problems with a very large number of states (eg. ≫ 1020 for games
such as chess or backgammon), is is hard to envision executing a sufficient number of
trials to visit each state frequently enough. It is necessary to use some generalization
method, which would permit to determine an effective policy based on a small part of
the state space explored.

Reinforcement learning — generalization 31 Reinforcement learning — generalization 32

Function approximation The main idea of this approach is not an approximation, using fewer coefficients, of
a function, which in fact requires many more of them, but the generalization. This is,
One of such methods is the function approximation, based on the representation of we want to generate a policy valid for all the states based on the analysis of a small
the function under examination (like U or Q) in some nontabular form, like a finite fraction of them.
formula. Similarly as was the case with the heuristic functions, some linear
combination of some features (also called the state attributes) can be used: For examples, in the experiments with the game backgammon, it was possible to train
a player to a level of play comparable to human players based on examining one in 1012
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s) states.

The reinforcement learning algorithm would learn the vector of coefficients Obviously, the success in reinforcement learning in such cases depends on the correct
θ = hθ1, θ2, ..., θni so that the evaluation function Ûθ would closely enough selection of the approximation function. If no combination of the selected features can
approximate the state utility function. And this provides a very compact representation give a good strategy for a game, then no method of learning the coefficients will lead
of the utility U (or Q) function for even very large state spaces. to one. On the other hand, selecting a very elaborate function, with a large number of
features and coefficients, increases the chance for a success, but at the expense of
a slower convergence and, consequently, the learning process.
This approach is called a function approximation because there is no proof that the
real evaluation function can be expressed by this kind of a formula. However, while it
seems doubtful that, for example, the optimal policy for chess can be expressed by
a function with just a few coefficients, it is entirely possible that a good level of
playing can be achieved this way.

Reinforcement learning — function approximation 33 Reinforcement learning — function approximation 34

Approximating direct utility estimation The rate of change of this error with respect to the parameter θi can be written as
∂Ej /∂θi, so in order to adjust this parameter toward decreasing the error, the proper
We can try to apply the idea of function generalization to the method of direct utility adjustment formula may be, where α is the usual learning coefficient:
estimation. Previously, we used this method directly on state utilities. Now we want to
convert to the parametric representation of the utilities: ∂Ej (s) ∂ Ûθ (s)
θi ← θi − α = θi + α(uj (s) − Ûθ (s))
∂θi ∂θi
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s)
Given a collection of trials, we obtain a set of sample values approximating Ûθ (s) and The above formula is known as the Widrow-Hoff or the delta rule.
a set of equations from the above formula, which can be solved for the optimal values
of hθ1, θ2, ..., θni in the sense of minimizing the squared error, for example, using the
linear regression.

Alternatively, we may try to use on-line learning, with some way of updating the
parameters based on the reinforcements obtained after each trial (or each step). We
want to express an update formula for each of the approximation parameters θi based
on the observed rewards. For example, if uj (s) is the reward-to-go for state s in j-th
trial, then the utility function approximation error can be computed as:

(Ûθ (s) − uj (s))2

Ej =
2

Reinforcement learning — function approximation 35 Reinforcement learning — function approximation 36

Approximating direct utility estimation — an example
As an example, for the 4x3 environment the state utility function could be
approximated using a linear combination of the coordinates:

Ûθ (x, y) = θ0 + θ1x + θ2y

According to the delta rule the corrections will be given by:

θ0 ← θ0 + α(uj (s) − Ûθ (s))

θ1 ← θ1 + α(uj (s) − Ûθ (s))x
θ2 ← θ2 + α(uj (s) − Ûθ (s))y

Assuming for an example θ = hθ0, θ1, θ2i = h0.5, 0.2, 0.1i we get the initial
approximation Ûθ (1, 1) = 0.8. If, after executing a trial, we computed eg.
uj (1, 1) = 0.72, then all the coefficients θ0, θ1, θ2 would be reduced by 0.08α, which in
turn would decrease the error for the state (1,1). Obviously, all the values of Ûθ (s)
would then change, which is the idea of generalization.

Reinforcement learning — function approximation 37 Reinforcement learning — function approximation 38

Approximating temporal-difference learning

It is also possible to apply the idea of parametric representation to the
temporal-difference learning. As in the previous approach, we need a formula to adjust
the representation parameters θ = hθ1, θ2, ..., θni to reduce the approximation error.
This time, instead of computing this error based on the reward-to-go values, we try to
reduce the temporal difference between successive states after each step.
The formula for TD-learning:

∂ Ûθ (s)
θi ← θi + α[R(s) + γ Ûθ (s′) − Ûθ (s)]
∂θi

and for Q-learning:

∂ Q̂θ (s, a)
θi ← θi + α[R(s) + γ max Q̂θ (s′, a′) − Q̂θ (s, a)]
′ a ∂θi

For passive TD learning, these update rules can be shown to converge to the closest
possible approximation to the true function when the function approximator is linear in
its features. But with active learning and non-linear approximation functions in many
cases the parameters do not converge, even when good solutions exist.

Reinforcement learning — function approximation 39 Reinforcement learning — function approximation 40

Deep reinforcement learning Useful resources

In many cases it is difficult to come up sufficiently good features for an acceptable Stuart J. Russell, Peter Norvig: Artificial Intelligence: A Modern Approach (Fourth
linear approximation of U or Q. Sometimes it is easier to develop a complex, Edition), Prentice-Hall, 2020
non-linear function, which works better. [Link]

In this role, neural networks can be used, and particularly deep neural networks. They David Silver: Reinforcement Learning course (2015, University College London):
can be used on data such as raw images, with no feature extraction at all. The deep [Link]
neural network is able to discover the useful features. A reinforcement learning system
that uses a deep network as a function approximator is called a deep reinforcement
learning system.

The deep neural network is a function parametrized by θ, elements of which are all the
parameters of the network (all the weights and biases in all the layers of the network).
The feed-forward neural network updates these parameters in the process called
backpropagation, which is their update rule.

Reinforcement learning — function approximation 41 Reinforcement learning — function approximation 42

Reinforcement Learning Concepts Explained
No ratings yet
Reinforcement Learning Concepts Explained
16 pages
Reinforcement Learning Techniques Overview
No ratings yet
Reinforcement Learning Techniques Overview
33 pages
Direct Utility Estimation in Reinforcement Learning
No ratings yet
Direct Utility Estimation in Reinforcement Learning
32 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
30 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
30 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
20 pages
Reinforcement Learning & NLP Overview
No ratings yet
Reinforcement Learning & NLP Overview
22 pages
Reinforcement Learning Overview and Techniques
No ratings yet
Reinforcement Learning Overview and Techniques
5 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
22 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
59 pages
Passive Reinforcement Learning Overview
No ratings yet
Passive Reinforcement Learning Overview
43 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
22 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
4 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
80 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
58 pages
Dynamic Programming in RL Planning
No ratings yet
Dynamic Programming in RL Planning
17 pages
Deep Reinforcement Learning Overview
No ratings yet
Deep Reinforcement Learning Overview
6 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
61 pages
Reinforcement Learning in MDPs Explained
No ratings yet
Reinforcement Learning in MDPs Explained
7 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
Independent Q-Learning in RL
No ratings yet
Independent Q-Learning in RL
50 pages
Deep Learning Unit5
No ratings yet
Deep Learning Unit5
12 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
53 pages
Reinforcement Learning Explained
No ratings yet
Reinforcement Learning Explained
45 pages
Reinforcement Learning Reinforcement Learning
No ratings yet
Reinforcement Learning Reinforcement Learning
24 pages
Reinforcement Learning Algorithms Overview
No ratings yet
Reinforcement Learning Algorithms Overview
42 pages
MonteCarlo RL
No ratings yet
MonteCarlo RL
50 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
57 pages
Reinforcement Learning Fundamentals
No ratings yet
Reinforcement Learning Fundamentals
42 pages
Unit-IV Passive Reinforcement Learning
No ratings yet
Unit-IV Passive Reinforcement Learning
6 pages
Reinforcement Learning Detailed Solutions
No ratings yet
Reinforcement Learning Detailed Solutions
4 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Understanding Deep Reinforcement Learning
No ratings yet
Understanding Deep Reinforcement Learning
17 pages
Basics of Reinforcement Learning
No ratings yet
Basics of Reinforcement Learning
15 pages
Ai Unit-5 PDF
No ratings yet
Ai Unit-5 PDF
33 pages
Understanding Reinforcement Learning Basics
No ratings yet
Understanding Reinforcement Learning Basics
6 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
47 pages
UC Berkeley Reinforcement Learning Guide
No ratings yet
UC Berkeley Reinforcement Learning Guide
46 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
34 pages
CS188 Project 2: Multi-Agent Pacman
No ratings yet
CS188 Project 2: Multi-Agent Pacman
38 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
28 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
23 pages
Reinforcement Learning for Robot Walking
No ratings yet
Reinforcement Learning for Robot Walking
9 pages
L11 Reinforcement Learning
No ratings yet
L11 Reinforcement Learning
55 pages
Overview of Reinforcement Learning
No ratings yet
Overview of Reinforcement Learning
23 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
30 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
5 pages
Reinforcement Learning and Genetic Algorithms
100% (1)
Reinforcement Learning and Genetic Algorithms
24 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
9 pages
Overview of Reinforcement Learning
No ratings yet
Overview of Reinforcement Learning
45 pages
Reinforcement Learning Overview and Methods
No ratings yet
Reinforcement Learning Overview and Methods
23 pages
Active Learning in Inverse Reinforcement Learning
No ratings yet
Active Learning in Inverse Reinforcement Learning
16 pages
Unit 02 Reinforcement For Agentic AI
No ratings yet
Unit 02 Reinforcement For Agentic AI
56 pages
Beginner's Guide to Deep RL
No ratings yet
Beginner's Guide to Deep RL
23 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
58 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Reinforcement Learning Crash Course
No ratings yet
Reinforcement Learning Crash Course
84 pages
Code Switching Effects on Grade 10 Learning
No ratings yet
Code Switching Effects on Grade 10 Learning
46 pages
Management of Puerperal Sepsis Risks
No ratings yet
Management of Puerperal Sepsis Risks
23 pages
Sanskrit Research Journal Vol VIII Issue II
No ratings yet
Sanskrit Research Journal Vol VIII Issue II
16 pages
F5 English Language Pre-Mock Exam Guide
No ratings yet
F5 English Language Pre-Mock Exam Guide
3 pages
Trial of Sergeant Pattinson: Anti-Nationalism
No ratings yet
Trial of Sergeant Pattinson: Anti-Nationalism
3 pages
Exploring Sound and Dance Elements in MAPEH
100% (1)
Exploring Sound and Dance Elements in MAPEH
12 pages
Anselm's Ontological Argument Explained
100% (1)
Anselm's Ontological Argument Explained
4 pages
Role of Restriction Enzymes in DNA Profiling
No ratings yet
Role of Restriction Enzymes in DNA Profiling
26 pages
Preposition (Practice Set)
No ratings yet
Preposition (Practice Set)
5 pages
The Greatness Playbook - Inner Freedom Edition
No ratings yet
The Greatness Playbook - Inner Freedom Edition
4 pages
Fast Metabolism Diet: A Complete Guide
100% (1)
Fast Metabolism Diet: A Complete Guide
31 pages
Understanding ERCP and Choledocholithiasis
No ratings yet
Understanding ERCP and Choledocholithiasis
4 pages
Vibration Monitoring Standards Overview
No ratings yet
Vibration Monitoring Standards Overview
2 pages
Residency Compliance in Varilla Case
No ratings yet
Residency Compliance in Varilla Case
2 pages
Goodbye Yellow Brick Road Sheet Music
No ratings yet
Goodbye Yellow Brick Road Sheet Music
3 pages
HOPE3-Lesson8 Final
No ratings yet
HOPE3-Lesson8 Final
11 pages
Functional Orthopedic Treatment for AOB
No ratings yet
Functional Orthopedic Treatment for AOB
10 pages
Predictive Models With Matlab
No ratings yet
Predictive Models With Matlab
275 pages
Quine vs. Hall on Ontological Commitment
No ratings yet
Quine vs. Hall on Ontological Commitment
7 pages
Managing Financial Institutions Effectively
No ratings yet
Managing Financial Institutions Effectively
22 pages
Overview of Microbiology and Microbes
No ratings yet
Overview of Microbiology and Microbes
3 pages
MV Underground Cable Failure in Portugal
No ratings yet
MV Underground Cable Failure in Portugal
5 pages
Class 11 Computer Science Assessment 2023-24
No ratings yet
Class 11 Computer Science Assessment 2023-24
5 pages
82kg Weight Loss Diet Plan Overview
No ratings yet
82kg Weight Loss Diet Plan Overview
2 pages
Abridged Karaite Prayer Guide
No ratings yet
Abridged Karaite Prayer Guide
11 pages
Kelas 7 Bahasa Inggris: Ujian Akhir
No ratings yet
Kelas 7 Bahasa Inggris: Ujian Akhir
6 pages
National Learning Camp 2025 Details
No ratings yet
National Learning Camp 2025 Details
20 pages
Ashley McCoy's Personality Insights
No ratings yet
Ashley McCoy's Personality Insights
11 pages
Crispy Shankarpali Sweet Treats
No ratings yet
Crispy Shankarpali Sweet Treats
2 pages
Development of Formula Student Electric Car Battery Design Procedure
No ratings yet
Development of Formula Student Electric Car Battery Design Procedure
5 pages

Introduction to Reinforcement Learning

Uploaded by

Introduction to Reinforcement Learning

Uploaded by

Reinforcement learning

We shall consider a probabilistic model of the agent’s action outcomes. In fact, we

Reinforcement learning — introduction 1 Reinforcement learning — introduction 2

Passive reinforcement learning Trials

Reinforcement learning — passive 3 Reinforcement learning — passive 4

obtained in the last section of the trial.1

Reinforcement learning — passive 5 Reinforcement learning — passive 6

Reinforcement learning — ADP method 7 Reinforcement learning — ADP method 8

Adaptive dynamic programming — efficiency

Reinforcement learning — ADP method 9 Reinforcement learning — ADP method 10

Temporal difference learning Temporal difference learning — the algorithm

U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))

RMS error in utility

RMS error, policy loss

U (s) = R(s) + γ max P (s′|s, a)U (s′)

Why is this happening?

Reinforcement learning — active 15 Reinforcement learning — active 16

Exploration The exploration policy

Reinforcement learning — exploration 17 Reinforcement learning — exploration 18

U +(s) ← R(s) + γ max f  P (s′|s, a)U +(s′), N (a, s)

Reinforcement learning — exploration 19 Reinforcement learning — exploration 20

1.8 (1,3) 1.2 RMS error

Reinforcement learning — exploration 21 Reinforcement learning — exploration 22

Active temporal difference (TD-)learning The Q-function

This represents a significant advantage of using — and learning — the Q-function

Reinforcement learning — TD-learning 23 Reinforcement learning — Q-learning 24

s′ ′ a known Q values and the update tries to minimize it.

Reinforcement learning — Q-learning 25 Reinforcement learning — Q-learning 26

The full Q-learning algorithm with exploration

Reinforcement learning — Q-learning 27 Reinforcement learning — Q-learning 28

However, in case of exploration the difference is significant. Q-learning is an off-policy

Reinforcement learning — Q-learning 29 Reinforcement learning — Q-learning 30

The above reinforcement learning algorithms assume an explicit representation of the

Reinforcement learning — generalization 31 Reinforcement learning — generalization 32

Reinforcement learning — function approximation 33 Reinforcement learning — function approximation 34

(Ûθ (s) − uj (s))2

Reinforcement learning — function approximation 35 Reinforcement learning — function approximation 36

Ûθ (x, y) = θ0 + θ1x + θ2y

According to the delta rule the corrections will be given by:

θ0 ← θ0 + α(uj (s) − Ûθ (s))

Reinforcement learning — function approximation 37 Reinforcement learning — function approximation 38

Approximating temporal-difference learning

and for Q-learning:

Reinforcement learning — function approximation 39 Reinforcement learning — function approximation 40

Deep reinforcement learning Useful resources

Reinforcement learning — function approximation 41 Reinforcement learning — function approximation 42

You might also like