Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
In many domains it is difficult to give a precise evaluation function, which would allow
an agent to evaluate the effectiveness or correctness of her actions, except after she
has reached a terminal state. In the terminal state the agent by default obtains an
objective evaluation of her actions which is termed a reward, or a reinforcement. It
is convenient to state the agent’s task so that it would have to act in such a way to
maximize this reward or reinforcement. This is called reinforcement learning.
In a general case the agent may not have full information about her environment, as
well as a precise, or any, description of her actions. The situation of this agent is
similar to one of the possible statement of a full task of artificial intelligence — the
agent is placed in an environment she does not know, and cannot act in it. She has to
learn to act in this environment effectively, to maximize some criterion, available as
reinforcements.
Let us first consider passive reinforcement learning, where we assume that the Recall the 4 × 3 environment we studied earlier:
agent’s policy π(s) is fixed. Agent is therefore bound to do what the policy dictates,
although the outcomes of her actions are probabilistic. The agent may watch what is 3 +1
3 +1 0.812 0.868 0.918
happening, so she knows what states she is reaching and what rewards she gets there.
2 2 0.762 0.660 –1
−1
The agent’s jobs is to learn the utilities of the states U π (s), computed according to
the equation:
∞ 1 0.705 0.655 0.611 0.388
U π (s) = E γ t R(st ) 1
X
t=0
1 2 3 4 1 2 3 4
For the 4x3 example environment used here for illustration we will assume γ = 1. The agent executes the trials where she performs actions according to the policy, until
reaching a terminal state, and receives percepts indicating both the current state and
the reinforcement. Example trials:
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1
Direct utility determination By averaging over a large number of samples she can determine the subsequent
approximations of the expected value of state utilities, which converge in the infinity to
The objective of the agent is to compute the state utilities U π (s) generated by the the real expected values. This way the reinforcement learning task is reduced to
current policy π(s). The state utilities are defined as expected values of the sums of a simple inductive learning.
the (discounted) reinforcements received by the agent, who started from a given state,
and is acting according to her policy: This approach works, but is not very efficient, since it requires a large number of trials.
The problem is, that the algorithm, by using simple averaging, ignores important
∞ information contained in the trials, namely, that state utilities in neighboring states are
U π (s) = E γ t R(st )
X
t=0 related.
The agent may approximate the above quantity using the trials by computing the
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
reward-to-go for each state visited along the way. At the end of each trial the agent
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
takes the reward obtained in that state to be its utility, and then, backtracking along
the trial, computes the reward-to-go for subsequent states, by summing up the rewards (1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1
For example, in the presented trials, the action → (Right) was executed three times
in the state (1,3). Two of these times the successor state was (2,3), so the agent
should compute P ((2, 3)|(1, 3), Right) = 32 .
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
(1, 1)−0.04 ❀ (1, 2)−0.04 ❀ (1, 3)−0.04 ❀ (2, 3)−0.04 ❀ (3, 3)−0.04 ❀ (3, 2)−0.04 ❀ (3, 3)−0.04 ❀ (4, 3)+1
(1, 1)−0.04 ❀ (2, 1)−0.04 ❀ (3, 1)−0.04 ❀ (3, 2)−0.04 ❀ (4, 2)−1
After executing every single action the agent updates the state utilities by solving the
(simplified) Bellman equation using one of the appropriate methods. The equation is
simplified because we only know the distribution of the effects of actions belonging to
the policy, and we cannot compute the best action for each state. Since we are
computing the U π we take just these actions.
0.8 (1,3)
Utility estimates
0.4
(1,1)
0.6 (3,2) 0.3
0.4 0.2
0.2 0.1
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials
The above charts show the convergence for an example learning experiment for the
4 × 3 environment. In this experiment the first trial ending in the “bad” terminal state
first occurs around the 78th trial, which causes large updates to some utility values.
Instead of solving the full equation system for each trial, it is possible to update the
utilities using the currently observed reinforcements. Such algorithm is called the
temporal difference (TD) method:
Note that this method does not require having a model of the environment P (s′|s, a),
nor does it compute one.
Reinforcement learning — temporal difference method 11 Reinforcement learning — temporal difference method 12
Convergence of the temporal difference method The convergence of another example learning experiment for the 4 × 3 environment:
0.6
1 (4,3)
There is a close relationship and similarity between the ADP and TD algorithms. While (3,3) 0.5
Utility estimates
(1,1) 0.4
the same values as for the ADP algorithm. 0.6
(2,1)
0.3
0.4 0.2
In case of learning with many transition examples, the frequencies of state occurrences
0.2 0.1
agrees with their probability distributions, and it can be proved that the utilities
converge to the correct values. 0 0
0 100 200 300 400 500 0 20 40 60 80 100
Number of trials Number of trials
For this to occur the learning parameter α should decrease with the number of
processed trials. More precisely, the subsequent values of this parameter should satisfy
the requirements:
∞
α(n) = ∞
X
n=1
and also: ∞
α2(n) < ∞
X
n=1
Reinforcement learning — temporal difference method 13 Reinforcement learning — temporal difference method 14
2
Active reinforcement learning
0.5
algorithm for a passive agent can be used for that. After that, the optimal policy,
2 –1
satisfying the Bellman equation, can be determined, like for a regular Markov decision
process: 0 1
0 50 100 150 200 250 300 350 400 450 500
Number of trials 1 2 3 4
a
s′
The agent can use the value iteration or the policy iteration algorithm. Then, having The chart on the left shows the result of learning for a sample experiment. The agent
determined the optimal policy she could simply execute this policy. found a direct path to the [+1] solution in the 39th trial, but this was the worse path,
along the states: (2,1), (3,1), (3,2), (3,3). This has, however, determined the agent’s
computed optimal policy, on the right. It turns out to be typical in this environment,
But is this really what she should do?
that the agent only rarely arrives at the optimal policy preferring the “upper” path:
(1,2), (1,3), (2,3), (3,3).
If the agent, while learning the rules of the world, performed an arbitrarily imposed In order to combine an effective world exploration with the exploitation of the acquired
policy, then she did not build a correct model of the world (P (s′|s, a)), because she did knowledge, an agent must have some exploration policy. This is necessary to ensure
not learn all her possibilities and the consequences of her possible actions. She did not that the agent is able to learn all available actions to the extent permitting her to
learn them because she did not perform them, because they were not part of the policy. calculate the globally optimal policy.
If, instead, the agent had not only followed the imposed policy, or after initially A simple exploration policy would be to execute some random actions in all states
following it, had not gone straight to calculating the optimal policy, but had performed some fraction of the time, and follow the optimal policy the rest of the time.
other actions for a while, she would have had a chance to learn more about her world.
This approach works, but is slow to converge. It would be better to prefer the
This is the trade-off between exploitation of existing knowledge and exploration of exploration of relatively unexplored state-action pairs, while at the same time avoiding
the environment. An agent should not too quickly accept the learned world model, and the pairs already better known and believed to be of low utility.
the calculated associated policy. If she wants to learn an accurate model of the world,
and then determine the truly optimal policy, she should try different possibilities.
Moreover, she must repeatedly try all the actions in all the states if she wants to avoid
having a random unlucky series prevent her from discovering some particularly good
move. However, eventually she will want to start executing the optimal policy in order
to reap the full benefits of the knowledge she has developed.
a
s′
where N (a, s) is the number of previous choices of action a in state s, and f (u, n) is
the exploration function, trading off the greed (large values of u) against curiosity
(small values of n).
Obviously the f function should be increasing with respect to u and decreasing with
respect to n. A simple definition of f can be:
R+ if n < Ne
f (u, n) =
u otherwise
where the R+ denotes the optimistic estimate of the best reward possible to obtain in
any state, and the Ne is the minimal number of times that the agent will want to try
each state-action pair.
2.2
(1,1) 1.4
2 (1,2)
RMS error, policy loss
1.6 (3,2) 1
(3,3)
1.4 (4,3) 0.8
1.2 0.6
1 0.4
0.8
0.2
0.6
0
0 20 40 60 80 100 0 20 40 60 80 100
Number of trials Number of trials
The chart on the left shows the results of learning with exploration. The policy close to
optimal was reached after 18 trials. Note that the utility values converge more slowly
(RMS error) than the optimal policy is determined (policy loss denotes the maximum
loss from executing a suboptimal policy).
The method of temporal differences can also be applied to active learning. The agent An alternative to temporal difference learning of utilities is the Q-learning method,
might not have a fixed policy, and still calculate the state utilities using the same which learns an action-value function Q(s, a), which represents the expected utility
update formula as in the passive case: of taking a given action in a given state. The Q-function is related to the utility
function U with the formula:
U π (s) ← U π (s) + α(R(s) + γU π (s′) − U π (s))
U (s) = max
a
Q(s, a)
Given the computed utilities the agent can determine the optimal actions for each
state using the utilities of successor states. It can be proved, that the active TD agent Furthermore, the optimal policy can be extracted from the Q-function as follows:
will eventually arrive at the same resulting utility values as the active ADP agent.
π ∗(s) = argmax Q(s, a)
a
s′
Q(s, a) ← Q(s, a) + α(R(s) + γ max
′
Q(s′, a′) − Q(s, a))
a
and the equation with respect to the Q-values only is:
This update is calculated whenever action a is executed in state s leading to the result
state s′. Here the term (R(s) + γ maxa′ Q(s′, a′) − Q(s, a)) represents an error in the
Q(s, a) = R(s) + γ P (s′|s, a) max Q(s′, a′)
X
The above equation can further be extended to the more general case of the reward We should once again note that the TD Q-learning agent does not need or learn the
function depending on the previous state and action taken therein, in addition to the transition model P (s′|s, a), so this represents a model-free approach to learning.
goal state, so the expected reward has to be computed, instead of taking just the fixed This method can be applied even in very complex domains, but on the other hand the
value: Q-learning agent has no way of looking into the future, so it may have difficulty when
rewards are sparse and long action sequences must be taken to reach them.
Q(s, a) = P (s′|s, a)[R(s, a, s′) + γ max Q(s′, a′)]
X
′ a
s′ Q-learning with temporal difference updating converges to the result much slower than
the ADP algorithm since it does not enforce computing the full consistency of the
model, which it does not have. The consequence it a much faster computation.
In general an active Q-learning agent requires exploration, just as it is in the case with
the ADP method. That is why the algorithm uses the exploration function f and the
action frequency table N . With a simpler exploration function (eg. executing some
fraction of random moves) the N table might not be necessary.
SARSA — State-Action-Reward-State-Action Q-learning is a more flexible algorithm, as it permits the agent to learn the optimal
behavior even if she currently executes a policy different from the best learned
There exists a variant of the Q-learning algorithm with a temporal difference update patterns. On the other hand, SARSA is more realistic, since, for example, if the agent
method called SARSA (State-Action-Reward-State-Action): was unable to fully control her policy, then it would be better for her to learn the
patterns corresponding to what will in fact happen to her, than to learn the best
possible patterns.
Q(s, a) ← Q(s, a) + α(R(s) + γQ(s′, a′) − Q(s, a))
SARSA updates take into account five elements: s, a, r, s′, a′. While the Q-learning Both Q-learning and SARSA are capable of learning the optimal policy for the 4x3
algorithm adjustments are based on the best action selected for the state reached by example environment, albeit more slowly than the ADP (in terms of the number of
the action a, SARSA considers which action was in fact selected. Therefore, eg. for iterations). This is because the local updates they make do not enforce the full
a greedy agent executing only exploitation, these two approaches would give the same consistency of the Q function.
result.
For example, for problems with a very large number of states (eg. ≫ 1020 for games
such as chess or backgammon), is is hard to envision executing a sufficient number of
trials to visit each state frequently enough. It is necessary to use some generalization
method, which would permit to determine an effective policy based on a small part of
the state space explored.
Function approximation The main idea of this approach is not an approximation, using fewer coefficients, of
a function, which in fact requires many more of them, but the generalization. This is,
One of such methods is the function approximation, based on the representation of we want to generate a policy valid for all the states based on the analysis of a small
the function under examination (like U or Q) in some nontabular form, like a finite fraction of them.
formula. Similarly as was the case with the heuristic functions, some linear
combination of some features (also called the state attributes) can be used: For examples, in the experiments with the game backgammon, it was possible to train
a player to a level of play comparable to human players based on examining one in 1012
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s) states.
The reinforcement learning algorithm would learn the vector of coefficients Obviously, the success in reinforcement learning in such cases depends on the correct
θ = hθ1, θ2, ..., θni so that the evaluation function Ûθ would closely enough selection of the approximation function. If no combination of the selected features can
approximate the state utility function. And this provides a very compact representation give a good strategy for a game, then no method of learning the coefficients will lead
of the utility U (or Q) function for even very large state spaces. to one. On the other hand, selecting a very elaborate function, with a large number of
features and coefficients, increases the chance for a success, but at the expense of
a slower convergence and, consequently, the learning process.
This approach is called a function approximation because there is no proof that the
real evaluation function can be expressed by this kind of a formula. However, while it
seems doubtful that, for example, the optimal policy for chess can be expressed by
a function with just a few coefficients, it is entirely possible that a good level of
playing can be achieved this way.
Approximating direct utility estimation The rate of change of this error with respect to the parameter θi can be written as
∂Ej /∂θi, so in order to adjust this parameter toward decreasing the error, the proper
We can try to apply the idea of function generalization to the method of direct utility adjustment formula may be, where α is the usual learning coefficient:
estimation. Previously, we used this method directly on state utilities. Now we want to
convert to the parametric representation of the utilities: ∂Ej (s) ∂ Ûθ (s)
θi ← θi − α = θi + α(uj (s) − Ûθ (s))
∂θi ∂θi
Ûθ (s) = θ1f1(s) + θ2f2(s) + ... + θnfn(s)
Given a collection of trials, we obtain a set of sample values approximating Ûθ (s) and The above formula is known as the Widrow-Hoff or the delta rule.
a set of equations from the above formula, which can be solved for the optimal values
of hθ1, θ2, ..., θni in the sense of minimizing the squared error, for example, using the
linear regression.
Alternatively, we may try to use on-line learning, with some way of updating the
parameters based on the reinforcements obtained after each trial (or each step). We
want to express an update formula for each of the approximation parameters θi based
on the observed rewards. For example, if uj (s) is the reward-to-go for state s in j-th
trial, then the utility function approximation error can be computed as:
Assuming for an example θ = hθ0, θ1, θ2i = h0.5, 0.2, 0.1i we get the initial
approximation Ûθ (1, 1) = 0.8. If, after executing a trial, we computed eg.
uj (1, 1) = 0.72, then all the coefficients θ0, θ1, θ2 would be reduced by 0.08α, which in
turn would decrease the error for the state (1,1). Obviously, all the values of Ûθ (s)
would then change, which is the idea of generalization.
∂ Ûθ (s)
θi ← θi + α[R(s) + γ Ûθ (s′) − Ûθ (s)]
∂θi
∂ Q̂θ (s, a)
θi ← θi + α[R(s) + γ max Q̂θ (s′, a′) − Q̂θ (s, a)]
′ a ∂θi
For passive TD learning, these update rules can be shown to converge to the closest
possible approximation to the true function when the function approximator is linear in
its features. But with active learning and non-linear approximation functions in many
cases the parameters do not converge, even when good solutions exist.
In many cases it is difficult to come up sufficiently good features for an acceptable Stuart J. Russell, Peter Norvig: Artificial Intelligence: A Modern Approach (Fourth
linear approximation of U or Q. Sometimes it is easier to develop a complex, Edition), Prentice-Hall, 2020
non-linear function, which works better. [Link]
In this role, neural networks can be used, and particularly deep neural networks. They David Silver: Reinforcement Learning course (2015, University College London):
can be used on data such as raw images, with no feature extraction at all. The deep [Link]
neural network is able to discover the useful features. A reinforcement learning system
that uses a deep network as a function approximator is called a deep reinforcement
learning system.
The deep neural network is a function parametrized by θ, elements of which are all the
parameters of the network (all the weights and biases in all the layers of the network).
The feed-forward neural network updates these parameters in the process called
backpropagation, which is their update rule.