0% found this document useful (0 votes)
12 views7 pages

Introduction to Reinforcement Learning

Reinforcement learning involves an agent learning to maximize rewards through interactions with an unknown environment, often modeled as a Markov decision problem. The document discusses passive reinforcement learning, where the agent follows a fixed policy to learn state utilities, and adaptive dynamic programming, which combines learning with dynamic programming to update state utilities based on observed transitions. It also covers temporal difference learning, which updates utilities using observed reinforcements, and emphasizes the importance of exploration in learning optimal policies.

Uploaded by

skaasc.csdept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Introduction to Reinforcement Learning

Reinforcement learning involves an agent learning to maximize rewards through interactions with an unknown environment, often modeled as a Markov decision problem. The document discusses passive reinforcement learning, where the agent follows a fixed policy to learn state utilities, and adaptive dynamic programming, which combines learning with dynamic programming to update state utilities based on observed transitions. It also covers temporal difference learning, which updates utilities using observed reinforcements, and emphasizes the importance of exploration in learning optimal policies.

Uploaded by

skaasc.csdept
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reinforcement learning

In many domains it is difficult to give a precise evaluation function, which would allow
an agent to evaluate the effectiveness or correctness of her actions, except after she
has reached a terminal state. In the terminal state the agent by default obtains an
objective evaluation of her actions which is termed a reward, or a reinforcement. It
is convenient to state the agent’s task so that it would have to act in such a way to
maximize this reward or reinforcement. This is called reinforcement learning.

In a general case the agent may not have full information about her environment, as
well as a precise, or any, description of her actions. The situation of this agent is
similar to one of the possible statement of a full task of artificial intelligence — the
agent is placed in an environment she does not know, and cannot act in it. She has to
learn to act in this environment effectively, to maximize some criterion, available as
reinforcements.

We shall consider a probabilistic model of the agent’s action outcomes. In fact, we


shall assume that we are dealing with an unknown Markov decision problem (MDP).

Reinforcement learning — introduction 1 Reinforcement learning — introduction 2

Passive reinforcement learning Trials

Let us first consider passive reinforcement learning, where we assume that the Recall the 4 × 3 environment we studied earlier:
agent’s policy π(s) is fixed. Agent is therefore bound to do what the policy dictates,
although the outcomes of her actions are probabilistic. The agent may watch what is 3 +1
3 +1 0.812 0.868 0.918
happening, so she knows what states she is reaching and what rewards she gets there.

2 2 0.762 0.660 –1
−1
The agent’s jobs is to learn the utilities of the states U π (s), computed according to
the equation:
∞ 1 0.705 0.655 0.611 0.388
 

U π (s) = E  γ t R(st ) 1
X

t=0
1 2 3 4 1 2 3 4

For the 4x3 example environment used here for illustration we will assume γ = 1. The agent executes the trials where she performs actions according to the policy, until
reaching a terminal state, and receives percepts indicating both the current state and
the reinforcement. Example trials:

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

Reinforcement learning — passive 3 Reinforcement learning — passive 4

Direct utility determination By averaging over a large number of samples she can determine the subsequent
approximations of the expected value of state utilities, which converge in the infinity to
The objective of the agent is to compute the state utilities U π (s) generated by the the real expected values. This way the reinforcement learning task is reduced to
current policy π(s). The state utilities are defined as expected values of the sums of a simple inductive learning.
the (discounted) reinforcements received by the agent, who started from a given state,
and is acting according to her policy: This approach works, but is not very efficient, since it requires a large number of trials.
The problem is, that the algorithm, by using simple averaging, ignores important
∞ information contained in the trials, namely, that state utilities in neighboring states are
 

U π (s) = E  γ t R(st )
X

t=0 related.
The agent may approximate the above quantity using the trials by computing the
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
reward-to-go for each state visited along the way. At the end of each trial the agent
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
takes the reward obtained in that state to be its utility, and then, backtracking along
the trial, computes the reward-to-go for subsequent states, by summing up the rewards (1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

obtained in the last section of the trial.1


For example, for the trial: For example, in the second trial of the previous example, the algorithm evaluates the
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1 utility of state (3,2) as the reward-to-go only from this trial, but ignores the fact, that
the successor state in this state is (3,3), which has an already known, significantly
we get Rtg (4, 3) = 1, Rtg (3, 3) = 0.96, Rtg (2, 3) = 0.92, Rtg (1, 3) = 0.88, Rtg (1, 2) = higher utility. The Bellman equation relates the utilities of successor states, but this
0.84, Rtg (1, 3) = 0.80, Rtg (1, 2) = 0.76, Rtg (1, 1) = 0.72 approach cannot take advantage of this.
1 In case of discounting coefficient γ 6= 1.0 the rewards should be properly discounted.

Reinforcement learning — passive 5 Reinforcement learning — passive 6


Adaptive dynamic programming Adaptive dynamic programming — the algorithm
The adaptive dynamic programming (ADP) is a process similar to the dynamic
programming, combined with learning the model of the environment, ie. the state
transition probability distribution and the reward function. It works by counting the
transitions from the state-action pairs to the next states. The trials provide the
training data of such transitions. The agent can compute the probability of the
transitions as frequencies of their occurrences in the trials.

For example, in the presented trials, the action → (Right) was executed three times
in the state (1,3). Two of these times the successor state was (2,3), so the agent
should compute P ((2, 3)|(1, 3), Right) = 32 .
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1

(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1

After executing every single action the agent updates the state utilities by solving the
(simplified) Bellman equation using one of the appropriate methods. The equation is
simplified because we only know the distribution of the effects of actions belonging to
the policy, and we cannot compute the best action for each state. Since we are
computing the U π we take just these actions.

Reinforcement learning — ADP method 7 Reinforcement learning — ADP method 8

Adaptive dynamic programming — efficiency


The ADP algorithm updates the utility values as best as it is possible, and in this
respect it is a standard that the other algorithms can be compared to. The policy
computation procedure, which solves a system of linear equations, can be intractable
for problems with large state spaces (eg. for the backgammon game we get 1050
equations with 1050 unknowns).
0.6
1 (4,3)
(3,3) 0.5
RMS error in utility

0.8 (1,3)
Utility estimates

0.4
(1,1)
0.6 (3,2) 0.3
0.4 0.2

0.2 0.1

0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials

The above charts show the convergence for an example learning experiment for the
4 × 3 environment. In this experiment the first trial ending in the “bad” terminal state
first occurs around the 78th trial, which causes large updates to some utility values.

Reinforcement learning — ADP method 9 Reinforcement learning — ADP method 10

Temporal difference learning Temporal difference learning — the algorithm

Instead of solving the full equation system for each trial, it is possible to update the
utilities using the currently observed reinforcements. Such algorithm is called the
temporal difference (TD) method:

U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))


In this case the single utility value is updated from a single observed state transition,
instead of the expected value of all transitions. This is why we take this correction —
the difference between the utility of the move and the utility of the state — reduced by
a factor α < 1. This introduces small corrections after each move. The correction
converges to zero when the state utility becomes equal to the discounted utility of
a move.

Note that this method does not require having a model of the environment P (s′|s, a),
nor does it compute one.

Reinforcement learning — temporal difference method 11 Reinforcement learning — temporal difference method 12
Convergence of the temporal difference method The convergence of another example learning experiment for the 4 × 3 environment:
0.6
1 (4,3)
There is a close relationship and similarity between the ADP and TD algorithms. While (3,3) 0.5

RMS error in utility


the latter makes only local updates in utility values, their average values converge to 0.8 (1,3)

Utility estimates
(1,1) 0.4
the same values as for the ADP algorithm. 0.6
(2,1)
0.3
0.4 0.2
In case of learning with many transition examples, the frequencies of state occurrences
0.2 0.1
agrees with their probability distributions, and it can be proved that the utilities
converge to the correct values. 0 0
0 100 200 300 400 500 0 20 40 60 80 100
Number of trials Number of trials

For this to occur the learning parameter α should decrease with the number of
processed trials. More precisely, the subsequent values of this parameter should satisfy
the requirements:

α(n) = ∞
X

n=1
and also: ∞
α2(n) < ∞
X

n=1

Reinforcement learning — temporal difference method 13 Reinforcement learning — temporal difference method 14

2
Active reinforcement learning

RMS error, policy loss


RMS error
1.5 Policy loss
What should do an agent, which does not have a predefined policy, or who would like
to find an optimal one?
1
3
First she should compute the complete transition model for all the actions. The ADP +1

0.5
algorithm for a passive agent can be used for that. After that, the optimal policy,
2 –1
satisfying the Bellman equation, can be determined, like for a regular Markov decision
process: 0 1
0 50 100 150 200 250 300 350 400 450 500
Number of trials 1 2 3 4

U (s) = R(s) + γ max P (s′|s, a)U (s′)


X

a
s′

The agent can use the value iteration or the policy iteration algorithm. Then, having The chart on the left shows the result of learning for a sample experiment. The agent
determined the optimal policy she could simply execute this policy. found a direct path to the [+1] solution in the 39th trial, but this was the worse path,
along the states: (2,1), (3,1), (3,2), (3,3). This has, however, determined the agent’s
computed optimal policy, on the right. It turns out to be typical in this environment,
But is this really what she should do?
that the agent only rarely arrives at the optimal policy preferring the “upper” path:
(1,2), (1,3), (2,3), (3,3).

Why is this happening?

Reinforcement learning — active 15 Reinforcement learning — active 16

Exploration The exploration policy

If the agent, while learning the rules of the world, performed an arbitrarily imposed In order to combine an effective world exploration with the exploitation of the acquired
policy, then she did not build a correct model of the world (P (s′|s, a)), because she did knowledge, an agent must have some exploration policy. This is necessary to ensure
not learn all her possibilities and the consequences of her possible actions. She did not that the agent is able to learn all available actions to the extent permitting her to
learn them because she did not perform them, because they were not part of the policy. calculate the globally optimal policy.

If, instead, the agent had not only followed the imposed policy, or after initially A simple exploration policy would be to execute some random actions in all states
following it, had not gone straight to calculating the optimal policy, but had performed some fraction of the time, and follow the optimal policy the rest of the time.
other actions for a while, she would have had a chance to learn more about her world.
This approach works, but is slow to converge. It would be better to prefer the
This is the trade-off between exploitation of existing knowledge and exploration of exploration of relatively unexplored state-action pairs, while at the same time avoiding
the environment. An agent should not too quickly accept the learned world model, and the pairs already better known and believed to be of low utility.
the calculated associated policy. If she wants to learn an accurate model of the world,
and then determine the truly optimal policy, she should try different possibilities.

Moreover, she must repeatedly try all the actions in all the states if she wants to avoid
having a random unlucky series prevent her from discovering some particularly good
move. However, eventually she will want to start executing the optimal policy in order
to reap the full benefits of the knowledge she has developed.

Reinforcement learning — exploration 17 Reinforcement learning — exploration 18


The exploration function The fact, that the update formula for U + in the right hand side also uses U + is
important. Since the states and actions surrounding the starting state will be executed
A reasonable exploration policy can be achieved by introducing the optimistic multiple times, if the agent used non-optimistic utilities for updating, she might get
approximation of utilities U +(s): discouraged from such states, and start to avoid “setting out”. Using U + instead
means, that the optimistic values generated around the newly explored regions will be
propagated back, and the actions leading to unknown regions will be favored.
 

U +(s) ← R(s) + γ max f  P (s′|s, a)U +(s′), N (a, s)


X

a
s′

where N (a, s) is the number of previous choices of action a in state s, and f (u, n) is
the exploration function, trading off the greed (large values of u) against curiosity
(small values of n).

Obviously the f function should be increasing with respect to u and decreasing with
respect to n. A simple definition of f can be:

R+ if n < Ne



f (u, n) = 

u otherwise

where the R+ denotes the optimistic estimate of the best reward possible to obtain in
any state, and the Ne is the minimal number of times that the agent will want to try
each state-action pair.

Reinforcement learning — exploration 19 Reinforcement learning — exploration 20

2.2
(1,1) 1.4
2 (1,2)
RMS error, policy loss

1.8 (1,3) 1.2 RMS error


(2,3) Policy loss
Utility estimates

1.6 (3,2) 1
(3,3)
1.4 (4,3) 0.8
1.2 0.6
1 0.4
0.8
0.2
0.6
0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials

The chart on the left shows the results of learning with exploration. The policy close to
optimal was reached after 18 trials. Note that the utility values converge more slowly
(RMS error) than the optimal policy is determined (policy loss denotes the maximum
loss from executing a suboptimal policy).

Reinforcement learning — exploration 21 Reinforcement learning — exploration 22

Active temporal difference (TD-)learning The Q-function

The method of temporal differences can also be applied to active learning. The agent An alternative to temporal difference learning of utilities is the Q-learning method,
might not have a fixed policy, and still calculate the state utilities using the same which learns an action-value function Q(s, a), which represents the expected utility
update formula as in the passive case: of taking a given action in a given state. The Q-function is related to the utility
function U with the formula:
U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))
U (s) = max
a
Q(s, a)

Given the computed utilities the agent can determine the optimal actions for each
state using the utilities of successor states. It can be proved, that the active TD agent Furthermore, the optimal policy can be extracted from the Q-function as follows:
will eventually arrive at the same resulting utility values as the active ADP agent.
π ∗(s) = argmax Q(s, a)
a

This represents a significant advantage of using — and learning — the Q-function


instead of the utilities U . Even having the full distribution of utilities, the agent must
also know the probability distribution P (s′|s, a) to compute the optimal action in each
state.
The Q-function gives this directly (almost, just by looking up the max).
The P (s′|s, a) distribution need not be learned or known.

Reinforcement learning — TD-learning 23 Reinforcement learning — Q-learning 24


A version of a Bellman equation can also be written for Q-values, deriving from the Q-learning — updating by temporal differences
fact that the expected total reward for taking an action is its immediated reward plus
the discounted utility of the outcome state: The above form of the Bellman equation can be used to derive the TD update formula
for the Q values, by analogy to the TD update for utilities:
′ ′
Q(s, a) = R(s) + γ P (s |s, a)U (s )
X

s′
Q(s, a) ← Q(s, a) + α(R(s) + γ max

Q(s′, a′) − Q(s, a))
a
and the equation with respect to the Q-values only is:
This update is calculated whenever action a is executed in state s leading to the result
state s′. Here the term (R(s) + γ maxa′ Q(s′, a′) − Q(s, a)) represents an error in the
Q(s, a) = R(s) + γ P (s′|s, a) max Q(s′, a′)
X

s′ ′ a known Q values and the update tries to minimize it.

The above equation can further be extended to the more general case of the reward We should once again note that the TD Q-learning agent does not need or learn the
function depending on the previous state and action taken therein, in addition to the transition model P (s′|s, a), so this represents a model-free approach to learning.
goal state, so the expected reward has to be computed, instead of taking just the fixed This method can be applied even in very complex domains, but on the other hand the
value: Q-learning agent has no way of looking into the future, so it may have difficulty when
rewards are sparse and long action sequences must be taken to reach them.
Q(s, a) = P (s′|s, a)[R(s, a, s′) + γ max Q(s′, a′)]
X

′ a
s′ Q-learning with temporal difference updating converges to the result much slower than
the ADP algorithm since it does not enforce computing the full consistency of the
model, which it does not have. The consequence it a much faster computation.

Reinforcement learning — Q-learning 25 Reinforcement learning — Q-learning 26

The full Q-learning algorithm with exploration

In general an active Q-learning agent requires exploration, just as it is in the case with
the ADP method. That is why the algorithm uses the exploration function f and the
action frequency table N . With a simpler exploration function (eg. executing some
fraction of random moves) the N table might not be necessary.

Reinforcement learning — Q-learning 27 Reinforcement learning — Q-learning 28

SARSA — State-Action-Reward-State-Action Q-learning is a more flexible algorithm, as it permits the agent to learn the optimal
behavior even if she currently executes a policy different from the best learned
There exists a variant of the Q-learning algorithm with a temporal difference update patterns. On the other hand, SARSA is more realistic, since, for example, if the agent
method called SARSA (State-Action-Reward-State-Action): was unable to fully control her policy, then it would be better for her to learn the
patterns corresponding to what will in fact happen to her, than to learn the best
possible patterns.
Q(s, a) ← Q(s, a) + α(R(s) + γQ(s′, a′) − Q(s, a))
SARSA updates take into account five elements: s, a, r, s′, a′. While the Q-learning Both Q-learning and SARSA are capable of learning the optimal policy for the 4x3
algorithm adjustments are based on the best action selected for the state reached by example environment, albeit more slowly than the ADP (in terms of the number of
the action a, SARSA considers which action was in fact selected. Therefore, eg. for iterations). This is because the local updates they make do not enforce the full
a greedy agent executing only exploitation, these two approaches would give the same consistency of the Q function.
result.

However, in case of exploration the difference is significant. Q-learning is an off-policy


learning algorithm, computing the best possible Q values, regardless of where the
current policy makes the agent go. On the other hand, SARSA is an on-policy
algorithm, which is more appropriate for the agent acting according to the policy.

Reinforcement learning — Q-learning 29 Reinforcement learning — Q-learning 30


Generalization in reinforcement learning

The above reinforcement learning algorithms assume an explicit representation of the


U (s) or Q(s) function, such as, eg. table representation. This may be practical only up
to some size of the problem.

For example, for problems with a very large number of states (eg. ≫ 1020 for games
such as chess or backgammon), is is hard to envision executing a sufficient number of
trials to visit each state frequently enough. It is necessary to use some generalization
method, which would permit to determine an effective policy based on a small part of
the state space explored.

Reinforcement learning — generalization 31 Reinforcement learning — generalization 32

Function approximation The main idea of this approach is not an approximation, using fewer coefficients, of
a function, which in fact requires many more of them, but the generalization. This is,
One of such methods is the function approximation, based on the representation of we want to generate a policy valid for all the states based on the analysis of a small
the function under examination (like U or Q) in some nontabular form, like a finite fraction of them.
formula. Similarly as was the case with the heuristic functions, some linear
combination of some features (also called the state attributes) can be used: For examples, in the experiments with the game backgammon, it was possible to train
a player to a level of play comparable to human players based on examining one in 1012
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s) states.

The reinforcement learning algorithm would learn the vector of coefficients Obviously, the success in reinforcement learning in such cases depends on the correct
θ = hθ1, θ2, ..., θni so that the evaluation function Ûθ would closely enough selection of the approximation function. If no combination of the selected features can
approximate the state utility function. And this provides a very compact representation give a good strategy for a game, then no method of learning the coefficients will lead
of the utility U (or Q) function for even very large state spaces. to one. On the other hand, selecting a very elaborate function, with a large number of
features and coefficients, increases the chance for a success, but at the expense of
a slower convergence and, consequently, the learning process.
This approach is called a function approximation because there is no proof that the
real evaluation function can be expressed by this kind of a formula. However, while it
seems doubtful that, for example, the optimal policy for chess can be expressed by
a function with just a few coefficients, it is entirely possible that a good level of
playing can be achieved this way.

Reinforcement learning — function approximation 33 Reinforcement learning — function approximation 34

Approximating direct utility estimation The rate of change of this error with respect to the parameter θi can be written as
∂Ej /∂θi, so in order to adjust this parameter toward decreasing the error, the proper
We can try to apply the idea of function generalization to the method of direct utility adjustment formula may be, where α is the usual learning coefficient:
estimation. Previously, we used this method directly on state utilities. Now we want to
convert to the parametric representation of the utilities: ∂Ej (s) ∂ Ûθ (s)
θi ← θi − α = θi + α(uj (s) − Ûθ (s))
∂θi ∂θi
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s)
Given a collection of trials, we obtain a set of sample values approximating Ûθ (s) and The above formula is known as the Widrow-Hoff or the delta rule.
a set of equations from the above formula, which can be solved for the optimal values
of hθ1, θ2, ..., θni in the sense of minimizing the squared error, for example, using the
linear regression.

Alternatively, we may try to use on-line learning, with some way of updating the
parameters based on the reinforcements obtained after each trial (or each step). We
want to express an update formula for each of the approximation parameters θi based
on the observed rewards. For example, if uj (s) is the reward-to-go for state s in j-th
trial, then the utility function approximation error can be computed as:

(Ûθ (s) − uj (s))2


Ej =
2

Reinforcement learning — function approximation 35 Reinforcement learning — function approximation 36


Approximating direct utility estimation — an example
As an example, for the 4x3 environment the state utility function could be
approximated using a linear combination of the coordinates:

Ûθ (x, y) = θ0 + θ1x + θ2y

According to the delta rule the corrections will be given by:

θ0 ← θ0 + α(uj (s) − Ûθ (s))


θ1 ← θ1 + α(uj (s) − Ûθ (s))x
θ2 ← θ2 + α(uj (s) − Ûθ (s))y

Assuming for an example θ = hθ0, θ1, θ2i = h0.5, 0.2, 0.1i we get the initial
approximation Ûθ (1, 1) = 0.8. If, after executing a trial, we computed eg.
uj (1, 1) = 0.72, then all the coefficients θ0, θ1, θ2 would be reduced by 0.08α, which in
turn would decrease the error for the state (1,1). Obviously, all the values of Ûθ (s)
would then change, which is the idea of generalization.

Reinforcement learning — function approximation 37 Reinforcement learning — function approximation 38

Approximating temporal-difference learning


It is also possible to apply the idea of parametric representation to the
temporal-difference learning. As in the previous approach, we need a formula to adjust
the representation parameters θ = hθ1, θ2, ..., θni to reduce the approximation error.
This time, instead of computing this error based on the reward-to-go values, we try to
reduce the temporal difference between successive states after each step.
The formula for TD-learning:

∂ Ûθ (s)
θi ← θi + α[R(s) + γ Ûθ (s′) − Ûθ (s)]
∂θi

and for Q-learning:

∂ Q̂θ (s, a)
θi ← θi + α[R(s) + γ max Q̂θ (s′, a′) − Q̂θ (s, a)]
′ a ∂θi

For passive TD learning, these update rules can be shown to converge to the closest
possible approximation to the true function when the function approximator is linear in
its features. But with active learning and non-linear approximation functions in many
cases the parameters do not converge, even when good solutions exist.

Reinforcement learning — function approximation 39 Reinforcement learning — function approximation 40

Deep reinforcement learning Useful resources

In many cases it is difficult to come up sufficiently good features for an acceptable Stuart J. Russell, Peter Norvig: Artificial Intelligence: A Modern Approach (Fourth
linear approximation of U or Q. Sometimes it is easier to develop a complex, Edition), Prentice-Hall, 2020
non-linear function, which works better. [Link]

In this role, neural networks can be used, and particularly deep neural networks. They David Silver: Reinforcement Learning course (2015, University College London):
can be used on data such as raw images, with no feature extraction at all. The deep [Link]
neural network is able to discover the useful features. A reinforcement learning system
that uses a deep network as a function approximator is called a deep reinforcement
learning system.

The deep neural network is a function parametrized by θ, elements of which are all the
parameters of the network (all the weights and biases in all the layers of the network).
The feed-forward neural network updates these parameters in the process called
backpropagation, which is their update rule.

Reinforcement learning — function approximation 41 Reinforcement learning — function approximation 42

You might also like