MLnotes Module 5
MLnotes Module 5
MODULE 5
Clustering Algorithms
Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning involves exploring
the given dataset. Cluster analysis is a technique of partitioning a collection of unlabelled objects that have
many attributes into meaningful disjoint groups or clusters. This is done using a trial-and-error approach as
there are no supervisors available as in classification. The characteristic of clustering is that the objects in
the clusters or groups are similar to each other within the clusters while differ from the objects in other
clusters significantly.
The input for cluster analysis is examples or samples. These are known as objects, data points or data
instances. All these terms are same and used interchangeably in this chapter. All the samples or objects with
no labels associated with them are called unlabelled. The output is the set of clusters (or groups) of similar
data if it exists in the input. For example, the following Figure 13.1(a) shows data points or samples with
two features shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to
indicate the clusters formed.
Visual identification of clusters in this case is easy as the examples have only two features. But, when
examples have more features, say 100, then clustering cannot be done manually and automatic clustering
algorithms are required. Also, automating the clustering process is desirable as these tasks are considered
difficult by humans and almost impossible. All clusters are represented by centroids. For example, if the
input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid is given as
The clusters should not overlap and every cluster should represent only one class. Therefore, clustering
algorithms use trial and error method to form clusters that can be converted to labels. Thus, the important
differences between classification and clustering are given in Table 13.1.
Department of CSE,CEC 1
Machine Learning (BCS02)
Applications of Clustering
A huge collection of data with higher dimensions (i.e., features or attributes) can pose a problem for
clustering algorithms. With the arrival of the internet, billions of data are available for clustering algorithms.
This is a difficult task, as scaling is always an issue with clustering algorithms. Scaling is an issue where
some algorithms work with lower dimension data but do not perform well for higher dimension data. Also,
units of data can post a problem, like some weights in kg and some in pounds can pose a problem in
clustering. Designing a proximity measure is also a big challenge.
Department of CSE,CEC 2
Machine Learning (BCS02)
Clustering algorithms need a measure to find the similarity or dissimilarity among the objects to group them.
Similarity and dissimilarity are collectively known as proximity measures. Often, the distance measures are
used to find similarity between two objects, say i and j.
Distance measures are known as dissimilarity measures, as these indicate how one object is different from
another. Measures like cosine similarity indicate the similarity among objects. Distance measures and
similarity measures are two sides of the same coin, as more distance indicates more similarity and vice versa.
Distance between two objects, say i and j, is denoted by the symbol Dij.
If all these conditions are satisfied, then the distance measure is called a metric.
Based on the data types of the attributes of an object, distance measures vary. The concept of data types is
discussed in Module 2. It can be recollected that the data types are divided into categorical and quantitative
variables. Quantitative variables are numbers and are of two types – nominal and ordinal. For example,
gender is a nominal variable as gender can be enumerated as Gender = {male, female}. Ordinal variables
look like nominal variables but have an inherent order present in the enumeration. For example, temperature
is a nominal variable as temperature can be enumerated as Temperature = {low, medium, high}. One can
observe the inherent order present, that is, medium > low and low < medium. Quantitative variables are
real or Integer numbers or binary data. In binary data, the attributes of the object can take a Boolean value.
Objects whose attributes take binary data are called binary objects.
Quantitative Variables
Some of the qualitative variables are discussed below.
1. Euclidean Distance- It is one of the most important and common distance measures. It is also
called as L₂ norm. It can be defined as the square root of squared differences between the
coordinates of a pair of objects.
The Euclidean distance between objects xi and xj with k features is given as follows:
The advantage of Euclidean distance is that the distance does not change with the addition of new
objects. But the disadvantage is that if the units change, the resulting Euclidean or squared Euclidean
changes drastically. Another disadvantage is that as the Euclidean distance involves a square root and
a square, the computational complexity is high for implementing the distance for millions or billions
of operations involved.
Department of CSE,CEC 3
Machine Learning (BCS02)
2. City Block Distance- City block distance is known as Manhattan distance. This is also known as
boxcar, absolute value distance, Manhattan distance, Taxicab or L₁ norm. The formula for finding
the distance is given as follows:
Example 13.1:
Suppose, if the coordinates of the objects are (0, 3) and (5, 8), then what is the Chebyshev distance?
Solution:
The Euclidean distance using Eq. (13.1) is given as follows:
4. Minkowski Distance-In general, all the above distance measures can be generalized as:
This is called Minkowski distance. Here, r is a parameter. When the value of r is 1, the distance
measure is called city block distance. When the value of r is 2, the distance measure is called
Euclidean distance. When r is ∞, then this is Chebyshev distance.
Department of CSE,CEC 4
Machine Learning (BCS02)
Binary Attributes
Binary attributes have only two values. Distance measures discussed above cannot be applied to find distance
between objects that have binary attributes. For finding the distance among objects with binary objects, the
contingency Table 13.3 can be used. Let x and y be the objects consisting of N-binary objects. Then, the
contingency table can be constructed by counting the number of matching of transitions, 0-0, 0-1, 1-0 and 1-
1.
1. Simple Matching Coefficient (SMC)-SMC is a simple distance measure and is defined as the ratio
of number of matching attributes and the number of attributes. The formula is given as:
2. Jaccard Coefficient- Jaccard coefficient is another useful measure and is given as follows:
Example 13.2:
If the given vectors are x=(1,0,0) and y=(1,1,1), then find the SMC and Jaccard coefficient?
Given vectors:
• x= (1,0,0)
• y= (1,1,1)
We need to find the values of a, b, c, and d using the Contingency Table (13.3) for binary attributes.
The counts mean:
• a: both x=0 and y=0y
• b: x=0 , y = 1
• c: x=1, y = 0
• d: both x=1 and y=1
Step-by-step Matching
Let's compare each position of x and y :
1. First position:
x=1, y = 1 → d increases by 1
2. Second position:
x = 0, y=1→ b increases by 1
Department of CSE,CEC 5
Machine Learning (BCS02)
3. Third position:
x = 0, y = 1 → b increases by 1 again
Final Counts
• a=0(no pair where both are 0)
• b=2(positions 2 and 3)
• c=0(no pair where x=1 and y=0)
• d=1(position 1)
3. Hamming Distance- Hamming distance is another useful measure that can be used for knowing the
sequence of characters or binary values. It indicates the number of positions at which the characters
or binary bits are different.
For example, the Hamming distance between x=(1 0 1) and y=(1 1 0) is 2 as x and y differ in two
positions. The distance between two words, say wood and hood is 1, as they differ in only one
character. Sometimes, more complex distance measures like edit distance can also be used.
Categorical Variables
In many cases, categorical values are used. It is just a code or symbol to represent the values. For example,
for the attribute Gender, a code 1 can be given to female and 0 can be given to male.
To calculate the distance between two objects represented by variables, we need to find only whether they
are equal or not. This is given as:
Ordinal Variables
Ordinal variables are like categorical values but with an inherent order. For example, designation is an ordinal
variable. If job designation is 1 or 2 or 3, it means code 1 is higher than 2 and code 2 is higher than 3. It is
ranked as 1 > 2 > 3.
Department of CSE,CEC 6
Machine Learning (BCS02)
Let us assume the designations of office employees are clerk, supervisor, manager and general manager.
These can be designated as numbers as clerk = 1, supervisor = 2, manager = 3 and general manager = 4.
Then, the distance between employee X who is a clerk and Y who is a manager can be obtained as:
Here, position (X) and position (Y) indicate the designated numerical value. Thus, the distance between X
(Clerk = 1) and Y (Manager = 3) using Eq. (13.8) is given as:
For text classification, vectors are normally used. Cosine similarity is a metric used to measure how similar
the documents are irrespective of their size. Cosine similarity measures the cosine of the angle between two
vectors projected in a multi-dimensional space. The similarity function for vector objects can be defined as:
The numerator is the dot product of the vectors A and B. The denominator is the product of the norm of
vectors A and B.
Example 13.3:
If the given vectors are A={1,1,0} and B={0,1,1}, then what is the cosine similarity?
Solution:
The dot product of the vector is 1×0+1×1+0×1= 1.
The norm of the vectors A and B is
Department of CSE,CEC 7
Machine Learning (BCS02)
Hierarchical methods produce a nested partition of objects with hierarchical relationships among objects.
Often, the hierarchy relationship is shown in the form of a dendrogram.
Hierarchical methods include categories, agglomerative methods and divisive methods.
• In agglomerative methods, initially all individual samples are considered as a cluster, that is, a
cluster with a single element. Then, they are merged and the process is continued to get a single
cluster.
• Divisive methods use another kind of philosophy, where a single cluster of all samples of the dataset
taken initially is chosen and then partitioned. This partition process is continued until the cluster is
split into smaller clusters.
Agglomerative methods merge clusters to reduce the number of clusters. This is repeated each time while
merging two closest clusters to get a single cluster. The procedure of agglomerative clustering is given as
follows:
1. Place each N sample or data instance into a separate cluster. So, initially N clusters are available.
2. Repeat the following steps until a single cluster is formed:
(a) Determine two most similar clusters.
(b) Merge the two clusters into a single cluster reducing the number of clusters as N−1.
3. Choose resultant clusters of step 2 as result.
All the clusters that are produced by hierarchical algorithms have equal diameters. The main disadvantage
of this approach is that once the cluster is formed, it is an irreversible decision.
Hierarchical clustering algorithms produce nested clusters, which can be visualized as a hierarchical tree or
dendrogram. The idea behind this approach is proximity among clusters. In single linkage algorithm, the
smallest distance (x,y), where x is from one cluster and y is from another cluster, is the distance between all
possible pairs of the two groups or clusters (or simply the smallest distance of two points where points are
in different clusters) and is used for merging the clusters. This corresponds to finding of minimum spanning
tree (MST) of a graph.
The distance measures between individual samples or data points is already demonstrated in the previous
section 13.2. To understand the single linkage algorithm, go through the following numerical problem that
involves finding the distance between clusters.
Example 13.4
Department of CSE,CEC 8
Machine Learning (BCS02)
Solution:
Step 1: Compute Euclidean distances (Proximity Matrix)
The similarity table among the variables is computed. Euclidean distance is calculated and displayed in the
following matrix:
Step-by-Step: Calculating the Proximity Matrix
• p=(x1,y1)
• q=(x2,y2)
•
2. Distance Calculations
Department of CSE,CEC 9
Machine Learning (BCS02)
Objects 0 1 2 3 4
0 – 4.123 7.211 17.804 27.295
1 4.123 – 3.606 14.142 23.324
2 7.211 3.606 – 10.630 20.124
3 17.804 14.142 10.630 – 10.198
4 27.295 23.324 20.124 10.198 –
The minimum distance is 3.606 between points 1 and 2, so we cluster {1,2} first.
Department of CSE,CEC 10
Machine Learning (BCS02)
Department of CSE,CEC 11
Machine Learning (BCS02)
Dendrogram for the above clustering is shown in Figure 13.3 for table 13.4.
Department of CSE,CEC 12
Machine Learning (BCS02)
every point is assigned a weight that decreases as the distance from the kernel center increases. The algorithm
is given below.
Here, K is the number of points and Sk is the data points where the distance from data points xi and centroid
of the kernel x is within the radius of the sphere. Then, the centroid is updated as x=x+vs
Step 5: Repeat the steps 3–4 for convergence. Once convergence is achieved, no further points can be
accommodated.
Advantages
1. No model assumptions
2. Suitable for all non-convex shapes
3. Only one parameter of the window, that is, bandwidth is required
4. Robust to noise
5. No issues of local minima or premature termination
Disadvantages
1. Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed.
If it is small, then many points are missed and convergence occurs as the problem.
2. The number of clusters cannot be specified and user has no control over this parameter.
‘k-means’ algorithm is a straightforward iterative partitional algorithm. Here, k stands for the user specified
requested clusters as users are not aware of the clusters that are present in the dataset. The k-means algorithm
assumes that the clusters do not overlap. Therefore, a sample or data point can belong to only one cluster in
the end. Also, this algorithm can detect clusters of shapes like circular or spherical.
Initially, the algorithm needs to be initialized. The algorithm can select k data points randomly or use the
prior knowledge of the data. In most cases, in kk-means algorithm setup, prior knowledge is absent. The
composition of the cluster is based on the initial condition, therefore, initialization is an important task. The
sample or data points need to be normalized for better performance.
The core process of the k-mean algorithm is assigning a sample to a cluster, that is, assigning each sample
or data point to the kk cluster centres based on its distance and the centroid of the clusters. This distance
should be minimum. As a new sample is added, new computation of the mean of the points for that cluster
to which a sample is assigned is required. Therefore, this iterative process is continued until no change of
instances to clusters is noticed. This algorithm then terminates and the termination is guaranteed.
Department of CSE,CEC 13
Machine Learning (BCS02)
k-means can also be viewed as greedy algorithm as it involves partitioning nn samples to kk clusters to
minimize Sum of Squared Error (SSE). SSE is a metric that is a measure of error that gives the sum of the
squared Euclidean distances of each data to its closest centroid. It is given as:
Here, ci is the centroid of the ith cluster, x is the sample or data point and dist is the Euclidean distance. The
aim of the k-means algorithm is to minimize SSE.
Advantages
1. Simple
2. Easy to implement
Disadvantages
1. It is sensitive to initialization process as change of initial points leads to different clusters.
2. If the samples are large, then the algorithm takes a lot of time.
Complexity
The complexity of k-means algorithm is dependent on the parameters like n, the number of samples, k, the
number of clusters, Θ(nkld). l is the number of iterations and d is the number of attributes. The complexity
of k-means algorithm is O(n2).
Example 13.5: Consider the following set of data given in Table 13.9. Cluster it using k-means algorithm
with the initial value of objects 2 and 5 with the coordinate values (4, 6) and (12, 4) as initial seeds.
Table 13.9: Sample Data
Objects X-coordinate Y-coordinate
1 2 4
2 4 6
3 6 8
4 10 4
5 12 4
Department of CSE,CEC 14
Machine Learning (BCS02)
Solution: As per the problem, choose the objects 2 and 5 with the coordinate values. Hereafter, the objects'
id is not important. The samples or data points (4, 6) and (12, 4) are started as two clusters as shown in Table
13.10.
Initially, centroid and data points are same as only one sample is involved.
Table 13.10: Initial Cluster Table
Cluster 1 Cluster 2
(4, 6) (12, 4)
Centroid 1 (4, 6) Centroid 2 (12, 4)
Iteration 1: Compare all the data points or samples with the centroid and assign to the nearest sample.
For the object 1 (2, 4), the Euclidean distance between it and the centroid is given as:
Dist (1, centroid 1) = √((2–4)² + (4–6)²) = √8 =2.9
Dist (1, centroid 2) = √((2–12)² + (4–4)²) = √100 = 10
Object 1 is closer to the centroid of cluster 1 and hence assign it to cluster 1. This is shown in Table
13.11.
For the object 3 (6, 8), the Euclidean distance between it and the centroid points is given as:
Dist (3, centroid 1) = √((6–4)² + (8–6)²) = √8 =2.9
Dist (3, centroid 2) = √((6–12)² + (8–4)²) = √52 =7.21
Object 3 is closer to the centroid of cluster 1 and hence remains in the same cluster 1.
Proceed with the next point object 4(10, 4) and again compare it with the centroids in Table 13.10.
Object 4 is closer to the centroid of cluster 2 and hence assign it to the cluster 2.
Recompute the new centroids of cluster 1 and cluster 2. They are (4, 6) and (11, 4), respectively.
Table 13.11: Cluster Table After Iteration 1
Cluster 1 Cluster 2
(4, 6), (2, 4), (6, 8) (10, 4), (12, 4)
Centroid 1 (4, 6) Centroid 2 (11, 4)
The second iteration is started again with the Table 13.11.
Obviously, the point (4, 6) remains in cluster 1, as the distance of it with itself is 0. The remaining
objects can be checked.
Take the sample object 1 (2, 4) and compare with the centroid of the clusters in Table 13.12.
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster.
Department of CSE,CEC 15
Machine Learning (BCS02)
Take the sample object 3 (6, 8) and compare with the centroid values of clusters 1 (4, 6) and cluster 2 (11,
4) of the Table 13.12.
Object 3 is closer to centroid of cluster 1 and hence remains in the same cluster.
Take the sample object 4 (10, 4) and compare with the centroid values of clusters 1 (4, 6) and cluster 2
(11, 4) of the Table 13.12:
Object 4 is closer to centroid of cluster 2 and hence remains in the same cluster.
Object 5 is closer to centroid of cluster 2 and hence remains in the same cluster.
There is no change in the cluster Table 13.12. It is exactly the same; therefore, the k-means algorithm
terminates with two clusters with data points as shown in the Table 13.12.
Density-based spatial clustering of applications with noise (DBSCAN) is one of the density-based
algorithms. The density of a region represents the region where many points above the specified threshold
are present. In a density-based approach, the clusters are regarded as dense regions of objects that are
separated by regions of low density such as noise. This is similar to a human’s intuitive way of observing
clusters.
The concept of density and connectivity is based on the local distance of neighbours. The functioning of this
algorithm is based on two parameters: the size of the neighbourhood (ε) and the minimum number of points
(m).
1. Core point – A point is called a core point if it has more than a specified number of points (m) within
its ε-neighbourhood.
2. Border point – A point is called a border point if it has fewer than m points but is a neighbour of a
core point.
3. Noise point – A point that is neither a core point nor a border point.
The main idea is that every data point or sample should have at least a minimum number of neighbors in a
neighborhood. The neighborhood of radius ε should have at least m points. The notion of density
connectedness determines the quality of the algorithm.
The following connectedness measures are used for this algorithm:
Department of CSE,CEC 16
Machine Learning (BCS02)
Subspace Clustering
Grid-based algorithms are useful for clustering high-dimensional data, that is, data with many attributes.
Some data like gene data may have millions of attributes. Every attribute is called a dimension. But all the
attributes are not needed, as in many applications one may not require all the attributes. For example, an
employee’s address may not be required for profiling his diseases. Age may be required in that case. So, one
can conclude that only a subset of features is required. For example, one may be interested in grouping gene
data with similar characteristics or organs that have similar functions.
Finding subspaces is difficult. For example, N dimensions may have 2^(N-1) subspaces. Exploring all the
subspaces is a difficult task. Here, only the CLIQUE algorithms are useful for exploring the subspaces.
CLIQUE (Clustering in Quest) is a grid-based method for finding clustering in subspaces. CLIQUE uses a
multiresolution grid data structure.
Department of CSE,CEC 17
Machine Learning (BCS02)
algorithm finds the number of cells, number of points, etc., and then combines the dense cells. For that, the
algorithm uses the contiguous intervals and a set of dense cells.
Monotonicity Property
CLIQUE uses the anti-monotonicity property or apriori property of the famous apriori algorithm. It means
that all the subsets of a frequent item should be frequent. Similarly, if the subset is infrequent, then all its
supersets are infrequent as well. Based on the apriori property, one can conclude that a k-dimensional cell
has r points if and only if every (k - 1) dimensional projection of this cell has at least r points. So like
association rule mining that uses the apriori rule, the candidate dense cells are generated for higher
dimensions. The algorithm works in two stages as shown below.
Advantages of CLIQUE
1. Insensitive to input order of objects
2. No assumptions of underlying data distributions
3. Finds subspace of higher dimensions such that high-density clusters exist in those subspaces
Disadvantage
The disadvantage of CLIQUE is that tuning of grid parameters, such as grid size, and finding optimal
threshold for finding whether the cell is dense or not is a challenge.
Department of CSE,CEC 18
Machine Learning (BCS02)
Reinforcement Learning
14.1 Overview of Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that mimics human learning through
interaction with the environment using senses like sight and hearing. The brain processes inputs and takes
actions, either voluntarily or involuntarily.
In real-world scenarios, learning happens through interaction. For example, when a child burns their hand
on fire, they learn not to repeat the action. Encouragement and scolding act as positive and negative
reinforcements, respectively, helping the child to repeat good behaviours and avoid bad ones. These
concepts are illustrated in Figure Examples of Positive and Negative Rewards.
In machines, this interaction is simulated through actions and corresponding positive or negative rewards.
RL uses a mathematical framework and is classified into:
• Positive Reinforcement Learning: Encourages behaviour via rewards (e.g., +10 points).
• Negative Reinforcement Learning: Discourages bad behaviour using penalties (e.g., -10 points).
RL is a form of semi-supervised learning used in sequential decision-making tasks.
A Grid Game shows a robot navigating a grid from starting point E to goal G, avoiding danger and
obstacles, using directions like left, right, up, and down.
Department of CSE,CEC 19
Machine Learning (BCS02)
RL is useful in uncertain or partially observable environments unlike object detection, which suits supervised
learning.
Department of CSE,CEC 20
Machine Learning (BCS02)
• Agent
• Environment
• State
• Action
• Reward
Department of CSE,CEC 21
Machine Learning (BCS02)
Reinforcement Problems
Reinforcement learning problems are generally categorized into two types:
1. Learning Problems:
Here, the environment is unknown to the agent, and the agent must learn by interacting with it
through trial and error. The goal is to improve the decision-making policy over time based on the
outcomes of actions taken.
2. Planning Problems:
In these scenarios, the environment is known beforehand. The agent uses this knowledge to plan its
actions in advance, optimizing the policy without needing to learn from interaction.
Department of CSE,CEC 22
Machine Learning (BCS02)
Episodes
‘Episodes’ is the number of steps necessary to reach the goal state from start state. For example, in Figure
above, the aim is to find optimal path between A and G. There are two types of episodes.
One is called episodic and another is called continuous. In episodic tasks, there is a well-defined starting
state and end state. End state is called as terminal state or goal state. An episode is one which consists of
states, actions, and rewards. For example, all the paths from A to G, such as...
For example, a successful path (episode) from state A to goal state G could be:
A→B→D→F→G
or
A → B → E → G.
Policies
A policy defines the agent’s behavior. It is a mapping from states to actions. Policies can be:
• Deterministic: The same action is always chosen for a given state.
Mathematically, it is denoted as:
a = π(s) (Equation 14.1)
• Stochastic: There is a probability distribution over actions for a given state.
Represented as:
π(a | s) = Pr[aₜ = a | sₜ = s] (Equation 14.2)
The goal of RL is to find an optimal policy that maximizes cumulative reward over time.
Rewards
Rewards are signals returned by the environment in response to the agent’s action. They guide the learning
process by indicating how good or bad an action was.
Immediate Reward:
This is given instantly after an action is taken. The total return from an episode is the sum of all immediate
rewards collected from start to end.
Expressed as:
Gₜ = rₜ₊₁ + rₜ₊₂ + rₜ₊₃ + ... + r_T (Equation 14.3)
Where Gₜ is the total return, and r_T is the reward at the terminal state.
Department of CSE,CEC 23
Machine Learning (BCS02)
Long-term Reward:
Instead of focusing only on immediate rewards, long-term rewards consider the sum of all future rewards.
This is particularly useful in continuous tasks.
To calculate long-term returns, we introduce a discount factor (γ), which gives less importance to future
rewards.
Mathematically:
Here:
• γ (gamma) is the discount factor (0 ≤ γ ≤ 1).
• A higher γ (close to 1) gives more weight to future rewards.
• A lower γ (close to 0) prioritizes immediate rewards only.
Common choices for γ are 0.8 or 0.9, striking a balance between immediate and future rewards.
Example 14.1
Suppose we receive a reward of +1 at the fifth step in a process. If the discount factor, denoted by γ, is 0.7,
we need to determine the discounted reward or utility. We assume that rewards from all prior steps (1st to
4th) are zero.
According to the discounting formula (Eq. 14.4), the reward at the fifth step is computed as:
(0.7)5×1=0.16807×1 = 0.16807
Thus, the discounted reward is 0.16807.
Broadly, reinforcement algorithms are categorized into two types: model-based and model-free methods,
depending on whether the model of the environment is known. Several real-world problems can be
approached using a Markov Decision Process (MDP), which we'll now explore.
In this example, 80% of students from university A move to university B, and only 20% remain at A.
Similarly, 60% of students from university B switch to A, and 40% stay at B.
Department of CSE,CEC 24
Machine Learning (BCS02)
From this, the transition matrix P can be constructed, which contains the probabilities of moving between
states:
Each row of the matrix represents a probability vector, meaning the sum of the elements in each row is 1.
If the vector u represents the initial state of the chain, the state after n steps is given by:
un+1=un×P 14.5
This allows us to determine how the system evolves over time by multiplying the current state vector with
the transition matrix.
Example 14.2
Assume initially 60% of students are at university A and 40% at university B, i.e., u0. Using the same
transition matrix P from Figure above, we can predict how these proportions change after one year and two
years.
After one year:
So, after one year, university A holds 28% and B holds 72%.
After two years:
If rewards are introduced into a Markov chain, it becomes a Markov Decision Process (MDP). In this case,
each edge in the graph is associated with both a probability and a reward.
The components of an MDP are:
1. A set of states
2. A set of actions
3. A reward function R
4. A policy π\pi
5. A value V
An MDP models the environment as a graph, with states as nodes and transitions as edges. Each edge has a
static transition probability.
The Markov Assumption is crucial for reinforcement learning. It states that the probability of reaching a
new state st+1 and obtaining a reward rt depends only on the current state st and action at, not on any earlier
states or rewards:
Department of CSE,CEC 25
Machine Learning (BCS02)
Once the MDP is modelled, the system is trained and tested by iterating actions and rewards to optimize
performance.
And the best action is the one that maximizes the expected reward:
Example 14.3
Assume a slot machine is used five times and returns rewards 1, 2, 0, 7, and 9. The quality of the chosen
action is calculated as:
To avoid storing all rewards, a recursive formula is used to update the average reward:
Department of CSE,CEC 26
Machine Learning (BCS02)
Greedy Method
When there is no prior knowledge, choosing the best current action is known as the Greedy Method, which
relies solely on exploitation. In a multi-arm bandit scenario, all levers are initially tried, and the one with the
highest reward is then exploited repeatedly.
For instance, if the fifth lever provides the highest reward, it is repeatedly used, even though another lever
might have offered better rewards. Thus, the greedy approach lacks exploration and may not always find the
truly optimal action.
Here is the descriptive content extracted from the provided pages, with equations and figure numbers
preserved:
ε – Greedy Method
The ε-greedy method combines exploration and exploitation strategies in reinforcement learning. The
parameter ε ranges from 0 to 1, where lower values Favor exploitation and higher values Favor exploration.
Algorithm 14.1: ε – Greedy
• Step 1: Generate a random number pp in the range [0, 1].
• Step 2: If p≥εp the action with the highest reward is selected (exploitation).
• Step 3: If p<εp < , an action is chosen randomly (exploration).
• Step 4: Update the value of the action and repeat for all actions.
With probability ε, a random action is selected. Otherwise, the action with the highest reward (greedy choice)
is taken. This creates a balance between exploration and exploitation. For instance, if ε=0.2, then 20% of the
time random actions are taken, and 80% of the time the best-known actions are selected.
Department of CSE,CEC 27
Machine Learning (BCS02)
Actor-Critic Methods
These are hybrid approaches that combine value-based and policy-based methods.
Model-based Approach
In this approach, a model of the environment is constructed. One way to represent the environment is through
a Markov Decision Process (MDP). Once the model is available, decision-making becomes more systematic.
Model-free Methods
In scenarios where a model of the environment is unavailable, simulation methods like Temporal Difference
(TD) and Monte Carlo techniques are employed to estimate values and policies.
In reinforcement learning, the focus is to take action a to transit from the current state to the end state. The
aim is to get rewards for the actions. They can be positive or negative.
The objective is to maximize the rewards by choosing the optimal policy.
Max E(rt∣r,st) for all possible values of ‘s’ at time ‘t’.
There are multiple courses of actions for a given state. The agent can prefer any course of action. So, the
behaviour of the agent needs to be understood. The algorithm to do that is called a policy. A policy is the
distribution of all actions with the probability of choosing these actions. Different actions can have different
rewards. It is necessary to find the total reward. This is possible only if the reward is quantified. It is done
using value functions.
Value Notation
The aim in reinforcement learning is to choose actions that maximize reward. This involves selecting the
optimal policy π\pi, which maps actions to their probability distributions. The policy maximizes the
expected cumulative reward:
Max E(rt∣r,st)
Department of CSE,CEC 28
Machine Learning (BCS02)
To understand which actions to take, value functions are used. A value function represents the expected
return from a state (or state-action pair) under a policy π.
This value function shows the expected reward from state ss for policy π\pi, considering all future discounted
rewards. It quantifies the goodness of a state.
The optimal state-value function is:
This function helps evaluate which action to take. The optimal value v∗ can also be derived from Q:
Department of CSE,CEC 29
Machine Learning (BCS02)
Value Iteration
Value iteration estimates v∗ in terms of steps as:
The algorithm for value iteration is given as follows:
Algorithm 14.2: Value Iteration
• Step 1: At each iteration k+1, for all states s∈S.
• Step 2: Update v k+1 (s) from vk(s′) where s′ is the successor state of s.
• Step 3:
Policy Iteration
The second algorithm is policy iteration. The task of finding an optimal policy consists of two steps:
• Policy Evaluation
• Policy Improvement
Policy Evaluation
Initially, for a given policy, the algorithm starts with value vv function as zero. There is no reward. The
Bellman equation is used to obtain the value of vv. Then, the value of vv is obtained from the previous vv.
This process is continued till the optimal policy π∗\pi_* is obtained. In other words, at each iteration, all the
state values are updated using the previous iteration value.
Policy evaluation estimates v in terms of steps:
v1>v2>⋯>vn
and policy improvement is possible if the new policy π′>π. The improvement is done using a greedy
approach.
The algorithm for policy evaluation is given as follows:
Algorithm 14.3: Policy Evaluation
Policy Improvement
The policy improvement can be done as follows:
1. Evaluate the policy π\pi with policy evaluation.
2. Solve the Bellman equations for the given policy to get vπv^\pi.
Department of CSE,CEC 30
Machine Learning (BCS02)
3. Improve the policy by applying the greedy approach for policy vπv^\pi so that the policy becomes
π′.
4. Repeat the process till vπ converges to optimal policy π∗
Department of CSE,CEC 31
Machine Learning (BCS02)
total return over many episodes divided by the number of episodes gives the average, alternative way to
count every time a state is encountered and reward until termination.
Since the problem is non-stationary, the value function is computed for fixed policy. Then, the policy is
revised using dynamic programming.
The mean value is computed as:
Thus, the incremental Monte-Carlo update at the end of the episode using the approach - incremental
every-visit Monte-Carlo is given as:
Algorithm 14.5: MC
Step 1: For a policy π\, select state s.
Step 2: Generate an episode.
Step 3: Compute average returns for every first occurrence of state ss in an episode.
Step 4: Update the returns and start a new episode. Repeat for all episodes.
Here, α is constant parameter and rt is the reward after time ‘t’, and st is the state visited at time ‘t’.
Department of CSE,CEC 32
Machine Learning (BCS02)
TD Learning can be accelerated through eligibility traces as future rewards may be done for one to many
nodes. As soon as it receives an input, it uses a table so that the previous prediction is updated. Eligibility
traces provide an alternative short-term memory that all previous predictions are noted in the table. This
leads to a family of algorithms as TD(λ). Here, λ is called decay parameter whose value ranges between 0
and 1. If it is 0, then only previous prediction is updated. If it is 1, then all the predictions are updated.
14.9 Q-LEARNING
Q-Learning is an off-policy method because Q is updated based on policies that are implicit to Q and is
better guaranteed for maximum returns. Q indicates Quality. What is Q-value?
Q(s, a) is a numerical value assigned to a state-action pair. It means a value of the action that is performed
in state ‘s’.
Here, the main objective is to find Q-Value? Q-value is nothing but the immediate reward and other rewards
that are yet to come, known as total return reward. To compute the total return reward, these information are
necessary:
1. Starting state
2. Action
3. Reward
4. New state
Initially, a table called Q-table is constructed and filled with initial random values. Q-learning involves two
policies - learning policy and update policy. Q-learning is done by methods like greedy, softmax or softmax
plus.
The agent’s next move will be the action where rewards are high. Alternatively, it would be the cell that has
the highest Q-value. As the moves are determined by Q-values, the computation and updation of Q-values
are important for Q-learning. Then comes the Q-learning update policy.
The updation of Q-values is done using Temporal-Differencing method. This updation is made by blending
the new and old values. Blending is done by α. When α is zero, the value does not change at all. When α is
one, the new value replaces the old value. Here, α is known as the learning rate. The discount factor (γ) is
also used for update. Blending is done using temporal difference learning.
The updation procedure of Q-value is given as follows:
1. Perform any random action on state sₜ
2. Get a new state, sₜ₊₁
Department of CSE,CEC 33
Machine Learning (BCS02)
Here, r(sₜ, aₜ) is the reward obtained by performing action aₜ, and Q(sₜ₊₁, a) is the estimate of the best action
at state sₜ₊₁, i.e., s′ and Q(sₜ, aₜ) is the Q-value of the action aₜ at state sₜ. Here, the best action performed at
future states discounted by discount factor γ is:
If TD is high, then it is a ‘surprise factor’ and denotes the highest reward. If TD is less, it represents the
‘frustration’ factor and denotes the lesser reward. In other words, TD is a sort of ‘reward’.
It is initially high and slowly gets minimized as it reaches the end of the training, that is, when it reaches the
final goal.
The update is carried out using the Temporal difference learning (TD) and Bellman equation.
The Bellman equation that is used to update the value is given as follows:
Department of CSE,CEC 34
Machine Learning (BCS02)
The major difference between SARSA and Q-learning is that not necessarily the maximum reward is used
to update Q value. Instead, it is based on estimates. Thus, SARSA is an online method.
SARSA is based on estimated optimal action:
Department of CSE,CEC 35