UNIT-3
LEARNING
Temporal-Difference prediction - Optimality of TD – Q
Learning – Off-Policy TD Control - R Learning –Actor Critic
Model- Unifying Monte Carlo and TD –Traces - Games
Q-learning:
Q-learning is a machine learning approach that enables a model to iteratively learn and
improve over time by taking the correct action. Q-learning is a type of reinforcement
learning.
Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning
approach aims to determine the optimal action based on its current state. The Q-learning
approach can accomplish this by either developing its own set of rules or deviating from
the prescribed policy. Because Q-learning may deviate from the given policy, a de ined
policy is not needed.
The multiple components of Q-learning include the following:
Agents. The agent is the entity that acts and operates within an environment.
States. The state is a variable that identifies the current position in an
environment of an agent.
Actions. The action is the agent's operation when it is in a specific state.
Rewards. A foundational concept within reinforcement learning is the
concept of providing either a positive or a negative response for the agent's
actions.
Episodes. An episode is when an agent can no longer take a new action and
ends up terminating.
Q-values. The Q-value is the metric used to measure an action at a particular
state.
Here are the two methods to determine the Q-value:
Temporal difference. The temporal difference formula calculates the Q-value
by incorporating the value of the current state and action by comparing the
differences with the previous state and action.
Bellman's equation. Bellman's equation is used to help calculate the value of
a given state and assess its relative position. The state with the highest value
is considered the optimal state.
What is a Q-table?
The Q-table includes columns and rows with lists of rewards for the best actions of each
state in a specific environment. A Q-table helps an agent understand what actions are
likely to lead to positive outcomes in different situations.
The table rows represent different situations the agent might encounter, and the columns
represent the actions it can take. As the agent interacts with the environment and
receives feedback in the form of rewards or penalties, the values in the Q-table are
updated to reflect what the model has learned.
The Q-table is directly related to the concept of the Q-function. The Q-function is a
mathematical equation that looks at the current state of the environment and the action
under consideration as inputs. The Q-function then generates outputs along with
expected future rewards for that action in the speci ic state. The Q-table allows the agent
to look up the expected future reward for any given state-action pair to move toward an
optimized state.
The procedural form of the algorithm is:
The parameters used in the Q-value update process are:
- the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values
are never updated, hence nothing is learned. Setting a high value such as 0.9 means
that learning can occur quickly.
- discount factor, also set between 0 and 1. This models the fact that future
rewards are worth less than immediate rewards. Mathematically, the discount
factor needs to be set less than 0 for the algorithm to converge.
- the maximum reward that is attainable in the state following the current
one. i.e the reward for taking the optimal action thereafter.
This procedural approach can be translated into plain english steps as follows:
1. Initialize the Q-values table, Q(s, a).
2. Observe the current state, s.
3. Choose an action, a, for that state based on one of the action selection policies
explained here on the previous page ( -soft, -greedy or softmax).
4. Take the action, and observe the reward, r, as well as the new state, s'.
5. Update the Q-value for the state using the observed reward and the maximum
reward possible for the next state. The updating is done according to the formula
and parameters described above.
6. Set the state to the new state, and repeat the process until a terminal state is
reached.
Off-Policy TD Control
On-policy and off-policy learning fall under the category of model-free reinforcement
learning algorithms, meaning that we do not have access to the transition probability
distribution.
This is opposed to model-based methods such as Monte-Carlo Tree Search (MCTS).
On-policy algorithms attempt to improve upon the current behavior policy that is
used to make decisions and therefore these algorithms learn the value of the policy
carried out by the agent.
Off-policy algorithms learn the value of the optimal policy, Q , and can improve upon
a policy that is different from the behavior policy.
Temporal Difference Learning (TD Learning)
One of the problems with the environment is that rewards usually are not immediately
observable. For example, in tic-tac-toe or others, we only know the reward(s) on the final
move (terminal state). All other moves will have 0 immediate rewards.
TD learning is an unsupervised technique to predict a variable's expected value in a
sequence of states. TD uses a mathematical trick to replace complex reasoning about the
future with a simple learning procedure that can produce the same results. Instead of
calculating the total future reward, TD tries to predict the combination of immediate
reward and its own reward prediction at the next moment in time.
Where the reward at time t is the combination of discounted rewards in the future. It
implies that future rewards are valued less. The TD Error is the difference between the
ultimate correct reward (V*_t) and our current prediction (V_t).
Parameters
Alpha (α): learning rate. This parameter shows how much we should adjust our
estimates based on the error. The learning rate is between 0 and 1. A large learning rate
adjusts aggressively and might lead to fluctuating training results — not converging. A
small learning rate adjusts slowly, which will take more time to converge.
Gamma (γ): the discount rate. How much we are valuing future rewards. The discount
rate is between 0 and 1. The bigger the discount rate, we more we valuing the future
rewards.
TD Prediction
As we saw for Monte Carlo methods, Prediction refers to the problem of estimating the
values of states, a value of a state is an indication of how good is that state for an agent in
the given environment, the higher the value of the state the better it is to be in that state.
Monte Carlo and Temporal Difference Learning are similar in the sense that they both use
real-world experience to evaluate a given policy, however, Monte Carlo methods wait until
the return following the visit is known which is after the episode ends is available to
update the value of the state, whereas TD methods update the state value in the next time
step, at the next time step t+1 they immediately form a target and make a useful update
using the observed reward.
Updating the state value just after one time step is called one-step TD or TD(0), which is a
special case of the TD(lambda) and n-step TD methods which are beyond the scope of our
discussion, however, the principles we explore here can be extended to those methods
without much complexity.
R Learning:
R-learning is a method for optimizing average reward, weighing both far-future and
near-term reward the same.
While optimising the average reward may seem more natural, it also has some
problems such as making agent “lazy”.
As long as the agent eventually reaches the goal, it will get maximum average reward,
since only “the tail” matters for the average.
There have been some attempts to work around this, but Q-learning remains the far
more popular algorithm.
Temporal Difference methods are said to combine the sampling of Monte Carlo with the
bootstrapping of DP, that is because in Monte Carlo methods target is an estimate because
we do not know the actual expected value rather use a sample return from that particular
episode, and in DP that target is an estimate because the value of the next state is not known
instead the current estimate is used, and in TD the target is an estimate because of both the
reasons, it samples the expected values and it uses the current estimate instead of the true
state value.
In the TD update equation, the quantity in brackets is a measure of an error, measuring
the difference between the estimated value of St and the better estimate at the next time
step. This quantity is called the TD error which has a widespread presence in all
reinforcement learning.