0% found this document useful (0 votes)
4 views35 pages

MLnotes Module 5

This document provides an overview of clustering algorithms, a key component of unsupervised learning, which involves grouping unlabelled data into meaningful clusters based on similarity. It discusses various proximity measures used to assess similarity or dissimilarity among data points, including Euclidean, Manhattan, and Jaccard distances, as well as challenges faced in clustering such as high dimensionality and data scaling. Additionally, it introduces hierarchical clustering methods, including agglomerative and divisive approaches, and highlights the irreversible nature of cluster formation in these methods.

Uploaded by

sowndaryad59
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views35 pages

MLnotes Module 5

This document provides an overview of clustering algorithms, a key component of unsupervised learning, which involves grouping unlabelled data into meaningful clusters based on similarity. It discusses various proximity measures used to assess similarity or dissimilarity among data points, including Euclidean, Manhattan, and Jaccard distances, as well as challenges faced in clustering such as high dimensionality and data scaling. Additionally, it introduces hierarchical clustering methods, including agglomerative and divisive approaches, and highlights the irreversible nature of cluster formation in these methods.

Uploaded by

sowndaryad59
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine Learning (BCS02)

MODULE 5
Clustering Algorithms

13.1 INTRODUCTION TO CLUSTERING APPROACHES

Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning involves exploring
the given dataset. Cluster analysis is a technique of partitioning a collection of unlabelled objects that have
many attributes into meaningful disjoint groups or clusters. This is done using a trial-and-error approach as
there are no supervisors available as in classification. The characteristic of clustering is that the objects in
the clusters or groups are similar to each other within the clusters while differ from the objects in other
clusters significantly.

The input for cluster analysis is examples or samples. These are known as objects, data points or data
instances. All these terms are same and used interchangeably in this chapter. All the samples or objects with
no labels associated with them are called unlabelled. The output is the set of clusters (or groups) of similar
data if it exists in the input. For example, the following Figure 13.1(a) shows data points or samples with
two features shown in different shaded samples and Figure 13.1(b) shows the manually drawn ellipse to
indicate the clusters formed.

Visual identification of clusters in this case is easy as the examples have only two features. But, when
examples have more features, say 100, then clustering cannot be done manually and automatic clustering
algorithms are required. Also, automating the clustering process is desirable as these tasks are considered
difficult by humans and almost impossible. All clusters are represented by centroids. For example, if the
input examples or data is (3, 3), (2, 6) and (7, 9), then the centroid is given as

The clusters should not overlap and every cluster should represent only one class. Therefore, clustering
algorithms use trial and error method to form clusters that can be converted to labels. Thus, the important
differences between classification and clustering are given in Table 13.1.

Department of CSE,CEC 1
Machine Learning (BCS02)

Table 13.1: Differences between Classification and Clustering

[Link]. Clustering Classification


1 Unsupervised learning and cluster formation are Supervised learning with the presence of a
done by trial and error as there is no supervisor supervisor to provide training and testing data
2 Unlabelled data Labelled data
3 No prior knowledge in clustering Knowledge of the domain is must to label the
samples of the dataset
4 Cluster results are dynamic Once a label is assigned, it does not change

Applications of Clustering

1. Grouping based on customer buying patterns


2. Profiling of customers based on lifestyle
3. In information retrieval applications (like retrieval of a document from a collection of documents)
4. Identifying the groups of genes that influence a disease
5. Identification of organs that are similar in physiology functions
6. Taxonomy of animals, plants in Biology
7. Clustering based on purchasing behaviour and demography
8. Document indexing
9. Data compression by grouping similar objects and finding duplicate objects

Challenges of Clustering Algorithms

A huge collection of data with higher dimensions (i.e., features or attributes) can pose a problem for
clustering algorithms. With the arrival of the internet, billions of data are available for clustering algorithms.
This is a difficult task, as scaling is always an issue with clustering algorithms. Scaling is an issue where
some algorithms work with lower dimension data but do not perform well for higher dimension data. Also,
units of data can post a problem, like some weights in kg and some in pounds can pose a problem in
clustering. Designing a proximity measure is also a big challenge.

Table 13.2: Advantages and Disadvantages of Clustering Algorithms

[Link]. Advantages Disadvantages


1 Cluster analysis algorithms can handle missing data and Cluster analysis algorithms are
outliers sensitive to initialization and order
of the input data.
2 Can help classifiers in labelling the unlabelled data. Semi- Often, the number of clusters
supervised algorithms use cluster analysis algorithms to present in the data have to be
label the unlabelled data and then use classifiers to classify specified by the user.
them.
3 It is easy to explain the cluster analysis algorithms and to Scaling is a problem.
implement them.
4 Clustering is the oldest technique in statistics and it is easy Designing a proximity measure
to explain. It is also relatively easy to implement. for the given data is an issue.

Department of CSE,CEC 2
Machine Learning (BCS02)

13.2 PROXIMITY MEASURES

Clustering algorithms need a measure to find the similarity or dissimilarity among the objects to group them.
Similarity and dissimilarity are collectively known as proximity measures. Often, the distance measures are
used to find similarity between two objects, say i and j.

Distance measures are known as dissimilarity measures, as these indicate how one object is different from
another. Measures like cosine similarity indicate the similarity among objects. Distance measures and
similarity measures are two sides of the same coin, as more distance indicates more similarity and vice versa.
Distance between two objects, say i and j, is denoted by the symbol Dij.

The properties of the distance measures are:

1. Dij is always positive or zero.


2. Dii=0 i.e., the distance between the object to itself is 0.
3. Dij=Dji – This property is called symmetry.
4. Dij≤Dik+Dkj– This property is called triangular inequality.

If all these conditions are satisfied, then the distance measure is called a metric.

Based on the data types of the attributes of an object, distance measures vary. The concept of data types is
discussed in Module 2. It can be recollected that the data types are divided into categorical and quantitative
variables. Quantitative variables are numbers and are of two types – nominal and ordinal. For example,
gender is a nominal variable as gender can be enumerated as Gender = {male, female}. Ordinal variables
look like nominal variables but have an inherent order present in the enumeration. For example, temperature
is a nominal variable as temperature can be enumerated as Temperature = {low, medium, high}. One can
observe the inherent order present, that is, medium > low and low < medium. Quantitative variables are
real or Integer numbers or binary data. In binary data, the attributes of the object can take a Boolean value.
Objects whose attributes take binary data are called binary objects.

Quantitative Variables
Some of the qualitative variables are discussed below.

1. Euclidean Distance- It is one of the most important and common distance measures. It is also
called as L₂ norm. It can be defined as the square root of squared differences between the
coordinates of a pair of objects.
The Euclidean distance between objects xi and xj with k features is given as follows:

The advantage of Euclidean distance is that the distance does not change with the addition of new
objects. But the disadvantage is that if the units change, the resulting Euclidean or squared Euclidean
changes drastically. Another disadvantage is that as the Euclidean distance involves a square root and
a square, the computational complexity is high for implementing the distance for millions or billions
of operations involved.

Department of CSE,CEC 3
Machine Learning (BCS02)

2. City Block Distance- City block distance is known as Manhattan distance. This is also known as
boxcar, absolute value distance, Manhattan distance, Taxicab or L₁ norm. The formula for finding
the distance is given as follows:

3. Chebyshev Distance-Chebyshev distance is known as maximum value distance. This is the


absolute magnitude of the differences between the coordinates of a pair of objects. This distance is
called supremum distance or Lmax or L∞ norm. The formula for computing Chebyshev distance is
given as follows:

𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥𝑖, 𝑥𝑗) = 𝑘𝑚𝑎𝑥 ∣ 𝑥𝑖𝑘 − 𝑥𝑗𝑘 ∣ (13.3)

Example 13.1:

Suppose, if the coordinates of the objects are (0, 3) and (5, 8), then what is the Chebyshev distance?

Solution:
The Euclidean distance using Eq. (13.1) is given as follows:

The Manhattan distance using Eq. (13.2) is given as follows:

The Chebyshev distance using Eq. (13.3) is given as follows:

4. Minkowski Distance-In general, all the above distance measures can be generalized as:

This is called Minkowski distance. Here, r is a parameter. When the value of r is 1, the distance
measure is called city block distance. When the value of r is 2, the distance measure is called
Euclidean distance. When r is ∞, then this is Chebyshev distance.

Department of CSE,CEC 4
Machine Learning (BCS02)

Binary Attributes

Binary attributes have only two values. Distance measures discussed above cannot be applied to find distance
between objects that have binary attributes. For finding the distance among objects with binary objects, the
contingency Table 13.3 can be used. Let x and y be the objects consisting of N-binary objects. Then, the
contingency table can be constructed by counting the number of matching of transitions, 0-0, 0-1, 1-0 and 1-
1.

Table 13.3: Contingency Table


Attributes Matching 0 1
0 a b
1 c d
In other words, ‘d’ is the number of attributes where x attribute is 1 and y attribute is 1.
‘b’ is the number of attributes where x attribute is 0 and y attribute is 1, ‘c’ is the number of attributes where
x attribute is 1 and y attribute is 0 and ‘a’ is the number of attributes where x attribute is 0 and y attribute is
0.

1. Simple Matching Coefficient (SMC)-SMC is a simple distance measure and is defined as the ratio
of number of matching attributes and the number of attributes. The formula is given as:

2. Jaccard Coefficient- Jaccard coefficient is another useful measure and is given as follows:

Example 13.2:
If the given vectors are x=(1,0,0) and y=(1,1,1), then find the SMC and Jaccard coefficient?
Given vectors:
• x= (1,0,0)
• y= (1,1,1)
We need to find the values of a, b, c, and d using the Contingency Table (13.3) for binary attributes.
The counts mean:
• a: both x=0 and y=0y
• b: x=0 , y = 1
• c: x=1, y = 0
• d: both x=1 and y=1

Step-by-step Matching
Let's compare each position of x and y :
1. First position:
x=1, y = 1 → d increases by 1
2. Second position:
x = 0, y=1→ b increases by 1

Department of CSE,CEC 5
Machine Learning (BCS02)

3. Third position:
x = 0, y = 1 → b increases by 1 again

Final Counts
• a=0(no pair where both are 0)
• b=2(positions 2 and 3)
• c=0(no pair where x=1 and y=0)
• d=1(position 1)

The SMC using Eq. (13.5) is given as:

Jaccard coefficient using Eq. (13.6) is given as:

3. Hamming Distance- Hamming distance is another useful measure that can be used for knowing the
sequence of characters or binary values. It indicates the number of positions at which the characters
or binary bits are different.
For example, the Hamming distance between x=(1 0 1) and y=(1 1 0) is 2 as x and y differ in two
positions. The distance between two words, say wood and hood is 1, as they differ in only one
character. Sometimes, more complex distance measures like edit distance can also be used.

Categorical Variables

In many cases, categorical values are used. It is just a code or symbol to represent the values. For example,
for the attribute Gender, a code 1 can be given to female and 0 can be given to male.

To calculate the distance between two objects represented by variables, we need to find only whether they
are equal or not. This is given as:

Ordinal Variables

Ordinal variables are like categorical values but with an inherent order. For example, designation is an ordinal
variable. If job designation is 1 or 2 or 3, it means code 1 is higher than 2 and code 2 is higher than 3. It is
ranked as 1 > 2 > 3.

Department of CSE,CEC 6
Machine Learning (BCS02)

Let us assume the designations of office employees are clerk, supervisor, manager and general manager.
These can be designated as numbers as clerk = 1, supervisor = 2, manager = 3 and general manager = 4.
Then, the distance between employee X who is a clerk and Y who is a manager can be obtained as:

Here, position (X) and position (Y) indicate the designated numerical value. Thus, the distance between X
(Clerk = 1) and Y (Manager = 3) using Eq. (13.8) is given as:

Vector Type Distance Measures

For text classification, vectors are normally used. Cosine similarity is a metric used to measure how similar
the documents are irrespective of their size. Cosine similarity measures the cosine of the angle between two
vectors projected in a multi-dimensional space. The similarity function for vector objects can be defined as:

The numerator is the dot product of the vectors A and B. The denominator is the product of the norm of
vectors A and B.

Example 13.3:

If the given vectors are A={1,1,0} and B={0,1,1}, then what is the cosine similarity?

Solution:
The dot product of the vector is 1×0+1×1+0×1= 1.
The norm of the vectors A and B is

So, the cosine similarity using Eq. (13.9) is given as:

Department of CSE,CEC 7
Machine Learning (BCS02)

13.3 HIERARCHICAL CLUSTERING ALGORITHMS

Hierarchical methods produce a nested partition of objects with hierarchical relationships among objects.
Often, the hierarchy relationship is shown in the form of a dendrogram.
Hierarchical methods include categories, agglomerative methods and divisive methods.
• In agglomerative methods, initially all individual samples are considered as a cluster, that is, a
cluster with a single element. Then, they are merged and the process is continued to get a single
cluster.
• Divisive methods use another kind of philosophy, where a single cluster of all samples of the dataset
taken initially is chosen and then partitioned. This partition process is continued until the cluster is
split into smaller clusters.
Agglomerative methods merge clusters to reduce the number of clusters. This is repeated each time while
merging two closest clusters to get a single cluster. The procedure of agglomerative clustering is given as
follows:

Algorithm 13.1: Agglomerative Clustering

1. Place each N sample or data instance into a separate cluster. So, initially N clusters are available.
2. Repeat the following steps until a single cluster is formed:
(a) Determine two most similar clusters.
(b) Merge the two clusters into a single cluster reducing the number of clusters as N−1.
3. Choose resultant clusters of step 2 as result.

All the clusters that are produced by hierarchical algorithms have equal diameters. The main disadvantage
of this approach is that once the cluster is formed, it is an irreversible decision.

13.3.1 Single Linkage or MIN Algorithm

Hierarchical clustering algorithms produce nested clusters, which can be visualized as a hierarchical tree or
dendrogram. The idea behind this approach is proximity among clusters. In single linkage algorithm, the
smallest distance (x,y), where x is from one cluster and y is from another cluster, is the distance between all
possible pairs of the two groups or clusters (or simply the smallest distance of two points where points are
in different clusters) and is used for merging the clusters. This corresponds to finding of minimum spanning
tree (MST) of a graph.
The distance measures between individual samples or data points is already demonstrated in the previous
section 13.2. To understand the single linkage algorithm, go through the following numerical problem that
involves finding the distance between clusters.

Example 13.4

Consider the array of points as shown in the following Table 13.4.


Table 13.4: Sample Data
Objects X Y
0 1 4
1 2 8
2 5 10
3 12 18
4 14 28
Apply single linkage algorithm.

Department of CSE,CEC 8
Machine Learning (BCS02)

Solution:
Step 1: Compute Euclidean distances (Proximity Matrix)
The similarity table among the variables is computed. Euclidean distance is calculated and displayed in the
following matrix:
Step-by-Step: Calculating the Proximity Matrix

1. Euclidean Distance Formula

• p=(x1,y1)
• q=(x2,y2)

2. Distance Calculations

Department of CSE,CEC 9
Machine Learning (BCS02)

4. Construct the Symmetric Matrix

Objects 0 1 2 3 4
0 – 4.123 7.211 17.804 27.295
1 4.123 – 3.606 14.142 23.324
2 7.211 3.606 – 10.630 20.124
3 17.804 14.142 10.630 – 10.198
4 27.295 23.324 20.124 10.198 –

The minimum distance is 3.606 between points 1 and 2, so we cluster {1,2} first.

Step 2: Update distance matrix after clustering {1, 2}

Table 13.6: After Iteration 1


Clusters {1,2} 0 3 4
{1,2} 4.123 10.630 20.124
0 - 17.804 27.295
3 – – 10.198
4 – – –

Using single linkage formula:

Department of CSE,CEC 10
Machine Learning (BCS02)

Step 3: Update distance matrix after clustering {0,1,2}


Table 13.7: After Iteration 2
Clusters {0,1,2} 3 4
{0,1,2} 10.630 20.124
3 – 10.198
4 – –

Step 4: Final clustering step


Table 13.8: After Iteration 3
Clusters {0,1,2} {3,4}
{0,1,2} ?
{3,4} –
• The computation of ? is not needed since only two clusters remain.
• Final merge: {0,1,2} and {3,4}

Dendrogram: it is used to plot hierarchical clusters


• A dendrogram for this clustering is shown in Figure 13.2.
• It visualizes the hierarchy of clustering based on the distances:

Department of CSE,CEC 11
Machine Learning (BCS02)

13.3.2 Complete Linkage or MAX or Clique


In complete linkage algorithm, the distance (x,y) , (where x is from one cluster and y is from another cluster),
is the largest distance between all possible pairs of the two groups or clusters (or simply the largest distance
of two points where points are in different clusters) as given below. It is used for merging the clusters.

Dendrogram for the above clustering is shown in Figure 13.3 for table 13.4.

13.3.3 Average Linkage


In case of an average linkage algorithm, the average distance of all pairs of points across the clusters is
used to form clusters. The average value computed between clusters Ci, Cj is given as follows:

Here, mi and mj are sizes of the clusters.


The dendrogram for Table 13.4 is given in Figure 13.4.

13.3.4 Mean-Shift Clustering Algorithm


Mean-shift is a non-parametric and hierarchical clustering algorithm. This algorithm is also known as mode
seeking algorithm or a sliding window algorithm. It has many applications in image processing and computer
vision.
There is no need for any prior knowledge of clusters or shape of the clusters present in the dataset. The
algorithm slowly moves from its initial position towards the dense regions.
The algorithm uses a window, which is basically a weighting function. Gaussian window is a good example
of a window. The radius of the kernel is called bandwidth. The entire window is called a kernel. The window
is based on the concept of kernel density function and its aim is to find the underlying data distribution. The
method of calculation of mean is dependent on the choice of windows. If a Gaussian window is chosen, then

Department of CSE,CEC 12
Machine Learning (BCS02)

every point is assigned a weight that decreases as the distance from the kernel center increases. The algorithm
is given below.

Algorithm 13.2: Mean-Shift Clustering

Step 1: Design a window.


Step 2: Place the window on a set of data points.
Step 3: Compute the mean for all the points that come under the window.
Step 4: Move the center of the window to the mean computed in step 3. Thus, the window moves towards
the dense regions. The movement to the dense region is controlled by a mean shift vector. The mean shift
vector is given as:

Here, K is the number of points and Sk is the data points where the distance from data points xi and centroid
of the kernel x is within the radius of the sphere. Then, the centroid is updated as x=x+vs

Step 5: Repeat the steps 3–4 for convergence. Once convergence is achieved, no further points can be
accommodated.
Advantages
1. No model assumptions
2. Suitable for all non-convex shapes
3. Only one parameter of the window, that is, bandwidth is required
4. Robust to noise
5. No issues of local minima or premature termination
Disadvantages
1. Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed.
If it is small, then many points are missed and convergence occurs as the problem.
2. The number of clusters cannot be specified and user has no control over this parameter.

13.4 Partitional Clustering Algorithm

‘k-means’ algorithm is a straightforward iterative partitional algorithm. Here, k stands for the user specified
requested clusters as users are not aware of the clusters that are present in the dataset. The k-means algorithm
assumes that the clusters do not overlap. Therefore, a sample or data point can belong to only one cluster in
the end. Also, this algorithm can detect clusters of shapes like circular or spherical.
Initially, the algorithm needs to be initialized. The algorithm can select k data points randomly or use the
prior knowledge of the data. In most cases, in kk-means algorithm setup, prior knowledge is absent. The
composition of the cluster is based on the initial condition, therefore, initialization is an important task. The
sample or data points need to be normalized for better performance.

The core process of the k-mean algorithm is assigning a sample to a cluster, that is, assigning each sample
or data point to the kk cluster centres based on its distance and the centroid of the clusters. This distance
should be minimum. As a new sample is added, new computation of the mean of the points for that cluster
to which a sample is assigned is required. Therefore, this iterative process is continued until no change of
instances to clusters is noticed. This algorithm then terminates and the termination is guaranteed.

Department of CSE,CEC 13
Machine Learning (BCS02)

Algorithm 13.3: k-means Algorithm


Step 1: Determine the number of clusters before the algorithm is started. This is called k.
Step 2: Choose k instances randomly. These are initial cluster centres.
Step 3: Compute the mean of the initial clusters and assign the remaining sample to the closest cluster
based on Euclidean distance or any other distance measure between the instances and the centroid of the
clusters.
Step 4: Compute new centroid again considering the newly added samples.
Step 5: Perform the steps 3–4 till the algorithm becomes stable with no more changes in assignment of
instances and clusters.

k-means can also be viewed as greedy algorithm as it involves partitioning nn samples to kk clusters to
minimize Sum of Squared Error (SSE). SSE is a metric that is a measure of error that gives the sum of the
squared Euclidean distances of each data to its closest centroid. It is given as:

Here, ci is the centroid of the ith cluster, x is the sample or data point and dist is the Euclidean distance. The
aim of the k-means algorithm is to minimize SSE.

Advantages
1. Simple
2. Easy to implement
Disadvantages
1. It is sensitive to initialization process as change of initial points leads to different clusters.
2. If the samples are large, then the algorithm takes a lot of time.

How to Choose the Value of k?


It is obvious that kk is the user specified value specifying the number of clusters that are present. Obviously,
there are no standard rules available to pick the value of k. Normally, the k-means algorithm is run with
multiple values of k and within group variance (sum of squares of samples with its centroid) and plotted as
a line graph. This plot is called Elbow curve. The optimal or best value of k can be determined from the
graph. The optimal value of k is identified by the flat or horizontal part of the Elbow curve.

Complexity
The complexity of k-means algorithm is dependent on the parameters like n, the number of samples, k, the
number of clusters, Θ(nkld). l is the number of iterations and d is the number of attributes. The complexity
of k-means algorithm is O(n2).

Example 13.5: Consider the following set of data given in Table 13.9. Cluster it using k-means algorithm
with the initial value of objects 2 and 5 with the coordinate values (4, 6) and (12, 4) as initial seeds.
Table 13.9: Sample Data
Objects X-coordinate Y-coordinate
1 2 4
2 4 6
3 6 8
4 10 4
5 12 4

Department of CSE,CEC 14
Machine Learning (BCS02)

Solution: As per the problem, choose the objects 2 and 5 with the coordinate values. Hereafter, the objects'
id is not important. The samples or data points (4, 6) and (12, 4) are started as two clusters as shown in Table
13.10.
Initially, centroid and data points are same as only one sample is involved.
Table 13.10: Initial Cluster Table
Cluster 1 Cluster 2
(4, 6) (12, 4)
Centroid 1 (4, 6) Centroid 2 (12, 4)

Iteration 1: Compare all the data points or samples with the centroid and assign to the nearest sample.

For the object 1 (2, 4), the Euclidean distance between it and the centroid is given as:
Dist (1, centroid 1) = √((2–4)² + (4–6)²) = √8 =2.9
Dist (1, centroid 2) = √((2–12)² + (4–4)²) = √100 = 10

Object 1 is closer to the centroid of cluster 1 and hence assign it to cluster 1. This is shown in Table
13.11.

Object 2 is taken as centroid point.

For the object 3 (6, 8), the Euclidean distance between it and the centroid points is given as:
Dist (3, centroid 1) = √((6–4)² + (8–6)²) = √8 =2.9
Dist (3, centroid 2) = √((6–12)² + (8–4)²) = √52 =7.21

Object 3 is closer to the centroid of cluster 1 and hence remains in the same cluster 1.

Proceed with the next point object 4(10, 4) and again compare it with the centroids in Table 13.10.

Dist (4, centroid 1) = √((10–4)² + (4–6)²) = √40


Dist (4, centroid 2) = √((10–12)² + (4–4)²) = √4 = 2

Object 4 is closer to the centroid of cluster 2 and hence assign it to the cluster 2.

Recompute the new centroids of cluster 1 and cluster 2. They are (4, 6) and (11, 4), respectively.
Table 13.11: Cluster Table After Iteration 1
Cluster 1 Cluster 2
(4, 6), (2, 4), (6, 8) (10, 4), (12, 4)
Centroid 1 (4, 6) Centroid 2 (11, 4)
The second iteration is started again with the Table 13.11.
Obviously, the point (4, 6) remains in cluster 1, as the distance of it with itself is 0. The remaining
objects can be checked.
Take the sample object 1 (2, 4) and compare with the centroid of the clusters in Table 13.12.

Dist (1, centroid 1) = √((2–4)² + (4–6)²) = √8 =2.9


Dist (1, centroid 2) = √((2–11)² + (4–4)²) = √81 = 9

Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster.

Department of CSE,CEC 15
Machine Learning (BCS02)

Take the sample object 3 (6, 8) and compare with the centroid values of clusters 1 (4, 6) and cluster 2 (11,
4) of the Table 13.12.

Dist (3, centroid 1) = √((6–4)² + (8–6)²) = √8 =2.9


Dist (3, centroid 2) = √((6–11)² + (8–4)²) = √41=6.40

Object 3 is closer to centroid of cluster 1 and hence remains in the same cluster.

Take the sample object 4 (10, 4) and compare with the centroid values of clusters 1 (4, 6) and cluster 2
(11, 4) of the Table 13.12:

Dist (4, centroid 1) = √((10–4)² + (4–6)²) = √40 =6.32


Dist (3, centroid 2) = √((10–11)² + (4–4)²) = √1 = 1

Object 4 is closer to centroid of cluster 2 and hence remains in the same cluster.

O Take the sample object 5 (12, 4) :


Dist (5, centroid 1) = √((12–4)² + (4–6)²) = √68 =8.25
Dist (5, centroid 2) = √((12–11)² + (4–4)²) = √1 = 1

Object 5 is closer to centroid of cluster 2 and hence remains in the same cluster.

Table 13.12: Cluster Table After Iteration 2


Cluster 1 Cluster 2
(4, 6), (2, 4), (6, 8) (10, 4), (12, 4)
Centroid (4, 6) Centroid (11, 4)

There is no change in the cluster Table 13.12. It is exactly the same; therefore, the k-means algorithm
terminates with two clusters with data points as shown in the Table 13.12.

13.5 DENSITY-BASED METHODS

Density-based spatial clustering of applications with noise (DBSCAN) is one of the density-based
algorithms. The density of a region represents the region where many points above the specified threshold
are present. In a density-based approach, the clusters are regarded as dense regions of objects that are
separated by regions of low density such as noise. This is similar to a human’s intuitive way of observing
clusters.
The concept of density and connectivity is based on the local distance of neighbours. The functioning of this
algorithm is based on two parameters: the size of the neighbourhood (ε) and the minimum number of points
(m).
1. Core point – A point is called a core point if it has more than a specified number of points (m) within
its ε-neighbourhood.
2. Border point – A point is called a border point if it has fewer than m points but is a neighbour of a
core point.
3. Noise point – A point that is neither a core point nor a border point.
The main idea is that every data point or sample should have at least a minimum number of neighbors in a
neighborhood. The neighborhood of radius ε should have at least m points. The notion of density
connectedness determines the quality of the algorithm.
The following connectedness measures are used for this algorithm:

Department of CSE,CEC 16
Machine Learning (BCS02)

1. Direct density reachable – The point X is directly reachable from Y if:


o (a) X is in the ε-neighborhood of Y
o (b) Y is a core point
2. Densely reachable – The point X is densely reachable from Y if there is a set of core points that
leads from Y to X.
3. Density connected – X and Y are densely connected if Z is a core point and thus points X and Y are
densely reachable from Z.

Algorithm 13.4: DBSCAN


Step 1: Randomly select a point p. Compute the distance between p and all other points.
Step 2: Find all points from p with respect to its neighborhood and check whether it has a minimum
number of points m. If so, it is marked as a core point.
Step 3: If it is a core point, then a new cluster is formed, or the existing cluster is enlarged.
Step 4: If it is a border point, then the algorithm moves to the next point and marks it as visited.
Step 5: If it is a noise point, they are removed.
Step 6: Merge the clusters if it is mergeable, i.e., dist(cᵢ, cⱼ) < ε.
Step 7: Repeat the process from Steps 3 to 6 until all points are processed.
Advantages:
1. No need for specifying the number of clusters beforehand
2. The algorithm can detect clusters of any shape
3. Robust to noise
4. Few parameters are needed
The complexity of this algorithm is O(n log n).

13.6 GRID-BASED APPROACH


The grid-based approach is a space-based approach. It partitions space into cells. The given data is fitted on
the cells for cluster formation.
There are three important concepts that need to be mastered for understanding grid-based schemes. They are:
1. Subspace clustering
2. Concept of dense cells
3. Monotonicity property

Subspace Clustering
Grid-based algorithms are useful for clustering high-dimensional data, that is, data with many attributes.
Some data like gene data may have millions of attributes. Every attribute is called a dimension. But all the
attributes are not needed, as in many applications one may not require all the attributes. For example, an
employee’s address may not be required for profiling his diseases. Age may be required in that case. So, one
can conclude that only a subset of features is required. For example, one may be interested in grouping gene
data with similar characteristics or organs that have similar functions.
Finding subspaces is difficult. For example, N dimensions may have 2^(N-1) subspaces. Exploring all the
subspaces is a difficult task. Here, only the CLIQUE algorithms are useful for exploring the subspaces.
CLIQUE (Clustering in Quest) is a grid-based method for finding clustering in subspaces. CLIQUE uses a
multiresolution grid data structure.

Concept of Dense Cells


CLIQUE partitions each dimension into several overlapping intervals and intervals it into cells. Then, the
algorithm determines whether the cell is dense or sparse. The cell is considered dense if it exceeds a threshold
value, say τ. Density is defined as the ratio of number of points and volume of the region. In one pass, the

Department of CSE,CEC 17
Machine Learning (BCS02)

algorithm finds the number of cells, number of points, etc., and then combines the dense cells. For that, the
algorithm uses the contiguous intervals and a set of dense cells.

Algorithm 13.5: Dense Cells


Step 1: Define a set of grid points and assign the given data points on the grid.
Step 2: Determine the dense and sparse cells. If the number of points in a cell exceeds the threshold value
τ, the cell is categorized as a dense cell. Sparse cells are removed from the list.
Step 3: Merge the dense cells if they are adjacent.
Step 4: Form a list of grid cells for every subspace as output.

Monotonicity Property
CLIQUE uses the anti-monotonicity property or apriori property of the famous apriori algorithm. It means
that all the subsets of a frequent item should be frequent. Similarly, if the subset is infrequent, then all its
supersets are infrequent as well. Based on the apriori property, one can conclude that a k-dimensional cell
has r points if and only if every (k - 1) dimensional projection of this cell has at least r points. So like
association rule mining that uses the apriori rule, the candidate dense cells are generated for higher
dimensions. The algorithm works in two stages as shown below.

Algorithm 13.6: CLIQUE


Stage 1:
Step 1: Identify the dense cells.
Step 2: Merge dense cells c₁ and c₂ if they share the same interval.
Step 3: Generate Apriori rule to generate (k + 1)ᵗʰ cell for higher dimension. Then, check whether the
number of points cross the threshold. This is repeated till there are no dense cells or a new generation of
dense cells.
Stage 2:
Step 1: Merging of dense cells into a cluster is carried out in each subspace using maximal regions to
cover dense cells. The maximal region is a hyperrectangle where all cells fall into.
Step 2: Maximal region tries to cover all dense cells to form clusters.
In stage two, CLIQUE starts from dimension 2 and starts merging. This process is continued till the n-
dimension.

Advantages of CLIQUE
1. Insensitive to input order of objects
2. No assumptions of underlying data distributions
3. Finds subspace of higher dimensions such that high-density clusters exist in those subspaces

Disadvantage
The disadvantage of CLIQUE is that tuning of grid parameters, such as grid size, and finding optimal
threshold for finding whether the cell is dense or not is a challenge.

Department of CSE,CEC 18
Machine Learning (BCS02)

Reinforcement Learning
14.1 Overview of Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that mimics human learning through
interaction with the environment using senses like sight and hearing. The brain processes inputs and takes
actions, either voluntarily or involuntarily.
In real-world scenarios, learning happens through interaction. For example, when a child burns their hand
on fire, they learn not to repeat the action. Encouragement and scolding act as positive and negative
reinforcements, respectively, helping the child to repeat good behaviours and avoid bad ones. These
concepts are illustrated in Figure Examples of Positive and Negative Rewards.

In machines, this interaction is simulated through actions and corresponding positive or negative rewards.
RL uses a mathematical framework and is classified into:
• Positive Reinforcement Learning: Encourages behaviour via rewards (e.g., +10 points).
• Negative Reinforcement Learning: Discourages bad behaviour using penalties (e.g., -10 points).
RL is a form of semi-supervised learning used in sequential decision-making tasks.

14.2 Scope of Reinforcement Learning


Reinforcement Learning is applied in scenarios where actions must be taken in uncertain environments to
reach a goal.

A Grid Game shows a robot navigating a grid from starting point E to goal G, avoiding danger and
obstacles, using directions like left, right, up, and down.

Department of CSE,CEC 19
Machine Learning (BCS02)

A Grid Game depicts another layout where:


• Grey tile = Danger
• Black tile = Block
• Diagonal lines = Goal

RL is useful in uncertain or partially observable environments unlike object detection, which suits supervised
learning.

Characteristics of Reinforcement Learning


1. Sequential Decision Making – Actions are performed step-by-step to reach a goal.
2. Delayed Feedback – Rewards may be received after several actions.
3. Action Interdependence – Each action affects future outcomes.
4. Time-Related Actions – Actions are time-stamped and ordered inherently.

Challenges of Reinforcement Learning


1. Reward Design – Difficult to define meaningful rewards.
2. Model Absence – No fixed model in many games.
3. Partial Observability – Complete environment information may not be available.
4. Time Consumption – Large state/action spaces take longer.
5. Complexity – Games like GO involve huge configurations.

Applications of Reinforcement Learning


• Industrial Automation
• Resource Allocation
• Traffic Light Optimization
• Recommendation Systems
• Advertisement Bidding
• Customized Applications
• Driverless Cars
• Deep Learning in Games (Chess, GO)
• Generative Applications (e.g., DeepMind)

Department of CSE,CEC 20
Machine Learning (BCS02)

14.3 Reinforcement Learning as Machine Learning


Unlike supervised learning (which has a labelled dataset and a supervisor), reinforcement learning has no
labelled data. Instead, data is generated through interactions between the agent and environment. For tasks
like Chess and GO, RL is more suitable because there is no labelled data for all possibilities.
RL focuses on trial and error where actions, states, and rewards are observed to determine patterns over time.
Decisions are sequential, meaning each step builds on the last.

Table 14.1: Differences between Reinforcement Learning and Supervised Learning


Reinforcement Learning Supervised Learning
No supervisor or labelled dataset initially Labelled data and a supervisor present
Decisions are sequential Independent decisions
Delayed feedback Instant feedback
Actions affect future input Input-dependent only
Goal-oriented Target class predefined
E.g., Chess, GO E.g., Classifiers

Table 14.2: Differences between Reinforcement and Unsupervised Learning


Reinforcement Learning Unsupervised Learning
Input-output mapping present Not present
Feedback from environment No feedback

14.4 COMPONENTS OF REINFORCEMENT LEARNING


Reinforcement Learning (RL) is a type of machine learning concerned with how agents should take actions
in an environment to maximize cumulative reward. The fundamental components of RL, as depicted in
Figure are:

• Agent
• Environment
• State
• Action
• Reward

Department of CSE,CEC 21
Machine Learning (BCS02)

Reinforcement Problems
Reinforcement learning problems are generally categorized into two types:
1. Learning Problems:
Here, the environment is unknown to the agent, and the agent must learn by interacting with it
through trial and error. The goal is to improve the decision-making policy over time based on the
outcomes of actions taken.
2. Planning Problems:
In these scenarios, the environment is known beforehand. The agent uses this knowledge to plan its
actions in advance, optimizing the policy without needing to learn from interaction.

Environment and Agent


The environment is the space or world in which the agent operates. It includes everything external to the
agent and provides the input (state), receives the output (action), and returns the reward. Initially, the
environment starts in a predefined state, often called the initial state.
For instance, in a car navigation system, the environment includes roads, maps, traffic signals, and any
obstacles. These collectively define the context in which the agent must act.
The agent is an autonomous system (like a robot, chatbot, or software program) that perceives the state of
the environment and takes actions based on a policy.

States and Actions


In RL, the input is called a state (a snapshot of the environment at a given time), and the output is an action
(what the agent decides to do in response).
In Figure , the states are A, B, C, D, E, F, G, H and I. These states can be cities of a shortest path problem or
any other state encountered by a robot navigation. Let A be the starting state and G the goal or target state to
be reached. The symbol ‘S’ is used to denote the general state and ‘sₜ’ the specific state. ‘sₜ’ is used to denote
a state at time ‘t’.

Other important nodes are:


1. GOAL node – This is also known as terminal node and absorbing state. It is the goal and the agent
aims to get as it is the highest discounted cumulative node.

Department of CSE,CEC 22
Machine Learning (BCS02)

2. Non-terminal node – All other nodes are called non-terminal nodes.


3. Start – The initial state.
Actions are transitions between states, say the path between A and B is caused by UP action. Then, the agent
collects the rewards. In the game shown in Figure above, the actions are UP and DOWN only. Symbol A is
used to denote the general actions – UP and DOWN. The symbol ‘a’ is used to specify the specific action
and ‘aₜ’ is used to denote the action at time ‘t’.

Episodes
‘Episodes’ is the number of steps necessary to reach the goal state from start state. For example, in Figure
above, the aim is to find optimal path between A and G. There are two types of episodes.
One is called episodic and another is called continuous. In episodic tasks, there is a well-defined starting
state and end state. End state is called as terminal state or goal state. An episode is one which consists of
states, actions, and rewards. For example, all the paths from A to G, such as...
For example, a successful path (episode) from state A to goal state G could be:

A→B→D→F→G
or
A → B → E → G.

Policies
A policy defines the agent’s behavior. It is a mapping from states to actions. Policies can be:
• Deterministic: The same action is always chosen for a given state.
Mathematically, it is denoted as:
a = π(s) (Equation 14.1)
• Stochastic: There is a probability distribution over actions for a given state.
Represented as:
π(a | s) = Pr[aₜ = a | sₜ = s] (Equation 14.2)
The goal of RL is to find an optimal policy that maximizes cumulative reward over time.

Rewards
Rewards are signals returned by the environment in response to the agent’s action. They guide the learning
process by indicating how good or bad an action was.

Immediate Reward:
This is given instantly after an action is taken. The total return from an episode is the sum of all immediate
rewards collected from start to end.
Expressed as:
Gₜ = rₜ₊₁ + rₜ₊₂ + rₜ₊₃ + ... + r_T (Equation 14.3)
Where Gₜ is the total return, and r_T is the reward at the terminal state.

Department of CSE,CEC 23
Machine Learning (BCS02)

Long-term Reward:
Instead of focusing only on immediate rewards, long-term rewards consider the sum of all future rewards.
This is particularly useful in continuous tasks.
To calculate long-term returns, we introduce a discount factor (γ), which gives less importance to future
rewards.
Mathematically:

Here:
• γ (gamma) is the discount factor (0 ≤ γ ≤ 1).
• A higher γ (close to 1) gives more weight to future rewards.
• A lower γ (close to 0) prioritizes immediate rewards only.
Common choices for γ are 0.8 or 0.9, striking a balance between immediate and future rewards.

Example 14.1
Suppose we receive a reward of +1 at the fifth step in a process. If the discount factor, denoted by γ, is 0.7,
we need to determine the discounted reward or utility. We assume that rewards from all prior steps (1st to
4th) are zero.
According to the discounting formula (Eq. 14.4), the reward at the fifth step is computed as:
(0.7)5×1=0.16807×1 = 0.16807
Thus, the discounted reward is 0.16807.

Broadly, reinforcement algorithms are categorized into two types: model-based and model-free methods,
depending on whether the model of the environment is known. Several real-world problems can be
approached using a Markov Decision Process (MDP), which we'll now explore.

14.5 MARKOV DECISION PROCESS


A Markov Decision Process (MDP) is an extension of a Markov Chain. A Markov chain is a stochastic
process that satisfies the Markov property, meaning that the next state depends only on the current state
and not on the sequence of events that preceded it. It is a sequence of random variables X0,X1,…,Xn
satisfying conditional probability.
An illustrative Markov chain with two states (representing universities A and B) is shown in Figure .

In this example, 80% of students from university A move to university B, and only 20% remain at A.
Similarly, 60% of students from university B switch to A, and 40% stay at B.

Department of CSE,CEC 24
Machine Learning (BCS02)

From this, the transition matrix P can be constructed, which contains the probabilities of moving between
states:

Each row of the matrix represents a probability vector, meaning the sum of the elements in each row is 1.
If the vector u represents the initial state of the chain, the state after n steps is given by:
un+1=un×P 14.5
This allows us to determine how the system evolves over time by multiplying the current state vector with
the transition matrix.

Example 14.2
Assume initially 60% of students are at university A and 40% at university B, i.e., u0. Using the same
transition matrix P from Figure above, we can predict how these proportions change after one year and two
years.
After one year:

So, after one year, university A holds 28% and B holds 72%.
After two years:

Thus, the proportions keep changing due to student transfers.

If rewards are introduced into a Markov chain, it becomes a Markov Decision Process (MDP). In this case,
each edge in the graph is associated with both a probability and a reward.
The components of an MDP are:
1. A set of states
2. A set of actions
3. A reward function R
4. A policy π\pi
5. A value V
An MDP models the environment as a graph, with states as nodes and transitions as edges. Each edge has a
static transition probability.
The Markov Assumption is crucial for reinforcement learning. It states that the probability of reaching a
new state st+1 and obtaining a reward rt depends only on the current state st and action at, not on any earlier
states or rewards:

An MDP performs the following:


• Observes the current state st
• Performs action at
• Gets reward r(st ,at)
• Enters new state st+1

Department of CSE,CEC 25
Machine Learning (BCS02)

The transition probability for an action a is:

And the expected reward is:

Once the MDP is modelled, the system is trained and tested by iterating actions and rewards to optimize
performance.

14.6 MULTI-ARM BANDIT PROBLEM AND REINFORCEMENT PROBLEM


TYPES
Reinforcement learning uses trial and error to determine the best actions to maximize reward. Two key
sub-problems in reinforcement learning are:
1. Prediction – This involves estimating total reward or value, often called policy evaluation. It
requires a function known as the state-value function, which can be computed using temporal
differencing (TD-Learning).
2. Control – This focuses on finding actions that yield maximum return. It is also known as policy
improvement.
An important problem in this context is the Multi-Arm Bandit (MAB) problem. Consider a slot machine
with five arms, each of which returns a random amount of money when pulled. The objective is to find the
best lever that gives the maximum average return.
This is illustrated in the Figure below.

The expected reward from pulling an arm a is:

And the best action is the one that maximizes the expected reward:

Example 14.3
Assume a slot machine is used five times and returns rewards 1, 2, 0, 7, and 9. The quality of the chosen
action is calculated as:

To avoid storing all rewards, a recursive formula is used to update the average reward:

Department of CSE,CEC 26
Machine Learning (BCS02)

Exploration vs Exploitation and Selection Policies


• Exploration involves trying all actions, even if some are suboptimal, to discover potentially better
rewards. It is considered an adventurous strategy.
• Exploitation uses existing knowledge to repeatedly choose the best-known option. While safer, it
may lead to suboptimal results.
A balance between exploration and exploitation helps design effective selection policies.

Greedy Method
When there is no prior knowledge, choosing the best current action is known as the Greedy Method, which
relies solely on exploitation. In a multi-arm bandit scenario, all levers are initially tried, and the one with the
highest reward is then exploited repeatedly.
For instance, if the fifth lever provides the highest reward, it is repeatedly used, even though another lever
might have offered better rewards. Thus, the greedy approach lacks exploration and may not always find the
truly optimal action.
Here is the descriptive content extracted from the provided pages, with equations and figure numbers
preserved:

ε – Greedy Method
The ε-greedy method combines exploration and exploitation strategies in reinforcement learning. The
parameter ε ranges from 0 to 1, where lower values Favor exploitation and higher values Favor exploration.
Algorithm 14.1: ε – Greedy
• Step 1: Generate a random number pp in the range [0, 1].
• Step 2: If p≥εp the action with the highest reward is selected (exploitation).
• Step 3: If p<εp < , an action is chosen randomly (exploration).
• Step 4: Update the value of the action and repeat for all actions.
With probability ε, a random action is selected. Otherwise, the action with the highest reward (greedy choice)
is taken. This creates a balance between exploration and exploitation. For instance, if ε=0.2, then 20% of the
time random actions are taken, and 80% of the time the best-known actions are selected.

Reinforcement Agent Types


Agents in reinforcement learning can be classified using the following methods:
1. Value-based approaches
2. Policy-based approaches
3. Hybrid methods
4. Model-based methods
5. Model-free methods
Value-based Approaches
The goal is to optimize the value function V(s) which represents the maximum expected future reward an
agent receives at each state, discounted by a factor γ. A higher value indicates a better choice of state by the
agent.
Policy-based Approach
This method seeks to determine an optimal policy π, which is a strategy that maps states to probabilities of
selecting various actions. The goal is to maximize the expected return by finding the best policy.

Department of CSE,CEC 27
Machine Learning (BCS02)

Actor-Critic Methods
These are hybrid approaches that combine value-based and policy-based methods.
Model-based Approach
In this approach, a model of the environment is constructed. One way to represent the environment is through
a Markov Decision Process (MDP). Once the model is available, decision-making becomes more systematic.
Model-free Methods
In scenarios where a model of the environment is unavailable, simulation methods like Temporal Difference
(TD) and Monte Carlo techniques are employed to estimate values and policies.

Reinforcement Algorithm Selection


Reinforcement algorithms are chosen based on the environment's characteristics and update types. Table
14.3 outlines algorithm choices:
Table 14.3: Algorithm Selection
Model Environment On/Off Algorithms Used
Present Static Off Dynamic Programming
Absent Dynamic On SARSA
Absent Dynamic - Temporal Learning, Monte Carlo Methods
Absent Dynamic Off Q-Learning

14.7 MODEL-BASED LEARNING (PASSIVE LEARNING)


Model-based learning assumes a known environment, where for any given state, the next state, action, and
associated probabilities are defined. The Markov Decision Process (MDP) and Dynamic Programming (DP)
are instrumental tools in passive learning scenarios.
The mathematical framework required for passive learning is provided by Markov Decision Process. These
model-based reinforcement problems can be solved using dynamic programming after the construction of
the model using MDP.

In reinforcement learning, the focus is to take action a to transit from the current state to the end state. The
aim is to get rewards for the actions. They can be positive or negative.
The objective is to maximize the rewards by choosing the optimal policy.
Max E(rt∣r,st) for all possible values of ‘s’ at time ‘t’.
There are multiple courses of actions for a given state. The agent can prefer any course of action. So, the
behaviour of the agent needs to be understood. The algorithm to do that is called a policy. A policy is the
distribution of all actions with the probability of choosing these actions. Different actions can have different
rewards. It is necessary to find the total reward. This is possible only if the reward is quantified. It is done
using value functions.

Value Notation
The aim in reinforcement learning is to choose actions that maximize reward. This involves selecting the
optimal policy π\pi, which maps actions to their probability distributions. The policy maximizes the
expected cumulative reward:
Max E(rt∣r,st)

Department of CSE,CEC 28
Machine Learning (BCS02)

To understand which actions to take, value functions are used. A value function represents the expected
return from a state (or state-action pair) under a policy π.

State Value Notation


Let vπ(s) be the value of a state under policy π. It is defined as:

This value function shows the expected reward from state ss for policy π\pi, considering all future discounted
rewards. It quantifies the goodness of a state.
The optimal state-value function is:

Using the Bellman equation, this can be recursively defined as:

Action Value Function


The action-value function Qπ(s,a) provides the expected return after taking action aa in state ss under policy
π. It is given by:

This function helps evaluate which action to take. The optimal value v∗ can also be derived from Q:

The best action is determined using:

The Q-function can also be expressed using value functions:

Algorithms for Solving Reinforcement Problems using Conventional Methods


Based on value functions, two primary conventional algorithms are used:
1. Value Iteration
2. Policy Iteration
These techniques use value and policy updates to determine optimal strategies in reinforcement learning
tasks.

Department of CSE,CEC 29
Machine Learning (BCS02)

Value Iteration
Value iteration estimates v∗ in terms of steps as:
The algorithm for value iteration is given as follows:
Algorithm 14.2: Value Iteration
• Step 1: At each iteration k+1, for all states s∈S.
• Step 2: Update v k+1 (s) from vk(s′) where s′ is the successor state of s.
• Step 3:

Slowly, this converges to v∗(s), and finally the policy is:

Policy Iteration
The second algorithm is policy iteration. The task of finding an optimal policy consists of two steps:
• Policy Evaluation
• Policy Improvement
Policy Evaluation
Initially, for a given policy, the algorithm starts with value vv function as zero. There is no reward. The
Bellman equation is used to obtain the value of vv. Then, the value of vv is obtained from the previous vv.
This process is continued till the optimal policy π∗\pi_* is obtained. In other words, at each iteration, all the
state values are updated using the previous iteration value.
Policy evaluation estimates v in terms of steps:
v1>v2>⋯>vn
and policy improvement is possible if the new policy π′>π. The improvement is done using a greedy
approach.
The algorithm for policy evaluation is given as follows:
Algorithm 14.3: Policy Evaluation

Policy Improvement
The policy improvement can be done as follows:
1. Evaluate the policy π\pi with policy evaluation.
2. Solve the Bellman equations for the given policy to get vπv^\pi.

Department of CSE,CEC 30
Machine Learning (BCS02)

3. Improve the policy by applying the greedy approach for policy vπv^\pi so that the policy becomes
π′.
4. Repeat the process till vπ converges to optimal policy π∗

The algorithm is given as follows:


Algorithm 14.4: Policy Improvement

14.8 MODEL FREE METHODS


In model-free methods, there is no complete knowledge of the environment. So, how to determine the
rewards then?
The simplest formulation is the method called episodic formulation. Recall from section 14.2 that an
episode is the path from start to the goal node. For example, the shortest path between state A to state G in
Figure 14.5 is called an episode.
The simplest way to determine rewards is to play the game. If the game ends in a win, then all actions that
are encountered in the episode are given a reward, say +1. If the game is lost, then all actions are given a
value of –1. The only problem is that the game may be lost due to the last bad move while all other actions
may be good.
The second method is the continuous formulation.
Here, the reward is obtained at each step rather than at the end. For example, a chess program is given a
value of +10 for winning, 0 for a draw, and –10 for a loss. But while playing, it is given a value of –0.1 for
each move. This encourages shorter games and quick wins.
In both formulations, value functions are used to estimate the expected rewards of states.
Here is the extracted text from the image for your class notes:

14.8.1 Monte-Carlo Methods


Monte Carlo (MC) methods do not assume any model. Therefore, complete knowledge of the environment
is not available. This is similar to human and animal experience where the knowledge is gained by
experience, and experience is gained from the interaction with the environment. MC uses the methods of
experience as well as simulation to gain knowledge of the environment.
The experience is divided into many episodes. An episode, as discussed earlier, is a sequence of states from
starting state to goal state.
MC makes the following assumptions:
1. All episodes should terminate. No matter from where we start, the episode should end.
2. MC methods use value-action functions. These values are computed only after the completion of
the episode. So, MC methods are incremental methods.
The MC method collects the reward at the end of the episode. Then, it is used to find maximum expected
future reward. This method does not use expected return. Instead, it uses the empirical mean as return. The

Department of CSE,CEC 31
Machine Learning (BCS02)

total return over many episodes divided by the number of episodes gives the average, alternative way to
count every time a state is encountered and reward until termination.
Since the problem is non-stationary, the value function is computed for fixed policy. Then, the policy is
revised using dynamic programming.
The mean value is computed as:

Thus, the incremental Monte-Carlo update at the end of the episode using the approach - incremental
every-visit Monte-Carlo is given as:

Here, α is called learning rate. The algorithm for MC methods is as follows:

Algorithm 14.5: MC
Step 1: For a policy π\, select state s.
Step 2: Generate an episode.
Step 3: Compute average returns for every first occurrence of state ss in an episode.
Step 4: Update the returns and start a new episode. Repeat for all episodes.

14.8.2 Temporal Difference Learning


Temporal Difference (TD) learning is an alternative of Monte Carlo method. It is model free. It learns from
environment based on experience and interaction as MC methods are based on the principle of bootstrapping.
TD is called bootstrapping method because the update is based on the existing estimate and future reward.
The difference is the updated new estimate. In TD method, in every stop an estimate of the final reward is
calculated at each state. Then, immediately the state-action is updated for every step.
Unlike Monte Carlo method that computes the rewards at the end of the episode, TD Learning computes the
rewards at every step. The difference is listed in the following Table 14.4.

Table 14.4: Differences Between Monte-Carlo Method and TD Learning Method


Monte-Carlo Method Temporal Difference Method
It is easy to understand and does not exploit Markov principle Exploits Markov principle
This requires termination and episode Does not require termination

The temporal differencing (TD) target is given as:

In TD(0), the update towards the estimated return is given as:

Here, α is constant parameter and rt is the reward after time ‘t’, and st is the state visited at time ‘t’.

Department of CSE,CEC 32
Machine Learning (BCS02)

TD Learning can be accelerated through eligibility traces as future rewards may be done for one to many
nodes. As soon as it receives an input, it uses a table so that the previous prediction is updated. Eligibility
traces provide an alternative short-term memory that all previous predictions are noted in the table. This
leads to a family of algorithms as TD(λ). Here, λ is called decay parameter whose value ranges between 0
and 1. If it is 0, then only previous prediction is updated. If it is 1, then all the predictions are updated.

Algorithm 14.6: TD-Learning


1. Initialize policy π.
2. Set v(s)=0 for all states s∈S .
3. Repeat for each episode.
4. Initialize s.
5. Repeat for each step of episode.
6. Choose action ‘a’ and execute it. Observe the new state and new reward.
7. Update value using Bellman Eq. (14.22).
8. Update as s′.
9. Until s is terminal state.

14.9 Q-LEARNING
Q-Learning is an off-policy method because Q is updated based on policies that are implicit to Q and is
better guaranteed for maximum returns. Q indicates Quality. What is Q-value?
Q(s, a) is a numerical value assigned to a state-action pair. It means a value of the action that is performed
in state ‘s’.
Here, the main objective is to find Q-Value? Q-value is nothing but the immediate reward and other rewards
that are yet to come, known as total return reward. To compute the total return reward, these information are
necessary:
1. Starting state
2. Action
3. Reward
4. New state
Initially, a table called Q-table is constructed and filled with initial random values. Q-learning involves two
policies - learning policy and update policy. Q-learning is done by methods like greedy, softmax or softmax
plus.
The agent’s next move will be the action where rewards are high. Alternatively, it would be the cell that has
the highest Q-value. As the moves are determined by Q-values, the computation and updation of Q-values
are important for Q-learning. Then comes the Q-learning update policy.
The updation of Q-values is done using Temporal-Differencing method. This updation is made by blending
the new and old values. Blending is done by α. When α is zero, the value does not change at all. When α is
one, the new value replaces the old value. Here, α is known as the learning rate. The discount factor (γ) is
also used for update. Blending is done using temporal difference learning.
The updation procedure of Q-value is given as follows:
1. Perform any random action on state sₜ
2. Get a new state, sₜ₊₁

Department of CSE,CEC 33
Machine Learning (BCS02)

3. Get the award r(sₜ, aₜ)


The temporal difference at time ‘t’ is done using the update:

Here, r(sₜ, aₜ) is the reward obtained by performing action aₜ, and Q(sₜ₊₁, a) is the estimate of the best action
at state sₜ₊₁, i.e., s′ and Q(sₜ, aₜ) is the Q-value of the action aₜ at state sₜ. Here, the best action performed at
future states discounted by discount factor γ is:

If TD is high, then it is a ‘surprise factor’ and denotes the highest reward. If TD is less, it represents the
‘frustration’ factor and denotes the lesser reward. In other words, TD is a sort of ‘reward’.
It is initially high and slowly gets minimized as it reaches the end of the training, that is, when it reaches the
final goal.
The update is carried out using the Temporal difference learning (TD) and Bellman equation.
The Bellman equation that is used to update the value is given as follows:

Algorithm 14.7: Q-Learning


1. Set Q(sf,a)=0 where sf is the terminal node in an episode.
2. For all states s and action a, set Q Value = 0.
3. Repeat.
4. Select sₜ randomly.
5. Choose action aₜ from Q using ε-greedy.
6. Perform action aₜ such that r(st+1,at)>0r(s_{t+1}, a_t) > 0 and reach the next state.
7. Update the action-value function using TD using Eq. (14.24) and Bellman equation:

Until the terminal state s is reached.


Here, α is the learning rate. It is between 0 and 1. If α is zero, then nothing is updated and there is no learning
at all. When setting it to a higher value such as 0.9, the learning happens fast. γ is the discount factor. It is in
the range 0–1. It should be less so that the algorithm can converge.
The inference is simple, the best action is the maximum of Q function as follows:

14.10 SARSA Learning


SARSA is based on episodic formulation and discounted factor. Its name is derived from the function Q(s,
a, r, s′, a′), that is, ‘r’ is the reward and s′(sₜ₊₁) and a′(aₜ₊₁) are the new states and actions, respectively. It
selects a state ‘s’ and actions ‘a’ are chosen based on the ε-greedy approach.
SARSA is an on-policy method because each action aₜ is based on Q and Q is updated based on the actions
taken. SARSA converges to optimal policy and with the action-value function provided all state-action pairs
are visited infinitely often.

Department of CSE,CEC 34
Machine Learning (BCS02)

The major difference between SARSA and Q-learning is that not necessarily the maximum reward is used
to update Q value. Instead, it is based on estimates. Thus, SARSA is an online method.
SARSA is based on estimated optimal action:

The ideal rule for updating Q:

This can be given in terms of reward as:

Where, α and γ are learning rate and discount factor, respectively.

Algorithm 14.8: SARSA


1. For all states and actions, set Q(st,a)=0
2. Initialize α, ε and γ.
Initialize the state sₜ.
Choose action aₜ based on selection policy ε-greedy strategy and execute it.
Get a new state sₜ₊₁, observe the reward rₜ₊₁.
Choose action aₜ₊₁ from Q based on ε-Greedy procedure and execute it.
Update Action-value function of Eq. (14.28).
3. Update the values st=st+1,at=at+1
4. If s′ is terminal node, then start a new episode else repeat above steps.
The parameters of the SARSA algorithm are same as the Q-learning algorithm, that is, learning rate and
discount factor.

Department of CSE,CEC 35

You might also like