UNIT V
Principal Component Analysis (PCA)
PCA (Principal Component Analysis) is a dimensionality reduction technique and helps us to
reduce the number of features in a dataset while keeping the most important information. It
changes complex datasets by transforming correlated features into a smaller set of uncorrelated
components.
It helps us to remove redundancy, improve computational efficiency and make data easier to
visualize and analyze.
How Principal Component Analysis Works
PCA uses linear algebra to transform data into new features called principal components. It
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the
covariance matrix. PCA selects the top components with the highest eigenvalues and projects
the data onto them simplify the dataset.
Imagine you’re looking at a messy cloud of data points like stars in the sky and want to simplify
it. PCA helps you find the "most important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:
Step 1: Standardize the Data
Different features may have different units and scales like salary vs. age. To compare them
fairly PCA first standardizes the data by making each feature have:
• A mean of 0
• A standard deviation of 1
𝑋−𝜇
𝑍=
𝜎
where:
• 𝜇 is the mean of independent features 𝜇 = {𝜇1 , 𝜇2 , ⋯ , 𝜇𝑚 }
• 𝜎 is the standard deviation of independent features 𝜎 = {𝜎1 , 𝜎2 , ⋯ , 𝜎𝑚 }
Step 2: Calculate Covariance Matrix
Next PCA calculates the covariance matrix to see how features relate to each other whether
they increase or decrease together. The covariance between two features 𝑥1 and 𝑥2 is:
𝑛
ˉ )(𝑥2𝑖 − 𝑥2
∑𝑖=1(𝑥 1𝑖 − 𝑥1 ˉ )
𝑐𝑜𝑣(𝑥1, 𝑥2) =
𝑛−1
Where:
• 𝑥ˉ1 𝑎𝑛𝑑𝑥ˉ2 are the mean values of features 𝑥1 𝑎𝑛𝑑𝑥2
• 𝑛 is the number of data points
The value of covariance can be positive, negative or zeros.
Step 3: Find the Principal Components
PCA identifies new axes where the data spreads out the most:
• 1st Principal Component (PC1): The direction of maximum variance (most spread).
• 2nd Principal Component (PC2): The next best direction, perpendicular to PC1 and
so on.
These directions come from the eigenvectors of the covariance matrix and their importance is
measured by eigenvalues. For a square matrix A an eigenvector X (a non-zero vector) and its
corresponding eigenvalue λ satisfy:
𝐴𝑋 = 𝜆𝑋
This means:
• When A acts on X it only stretches or shrinks X by the scalar λ.
• The direction of X remains unchanged hence eigenvectors define "stable directions"
of A.
Eigenvalues help rank these directions by importance.
Step 4: Pick the Top Directions & Transform Data
After calculating the eigenvalues and eigenvectors PCA ranks them by the amount of
information they capture. We then:
1. Select the top k components that capture most of the variance like 95%.
2. Transform the original dataset by projecting it onto these top components.
This means we reduce the number of features (dimensions) while keeping the important
patterns in the data.
Transform this 2D dataset into a 1D representation while preserving as much variance as
possible.
In the above image the original dataset has two features "Radius" and "Area" represented by
the black axes. PCA identifies two new directions: PC₁ and PC₂ which are the principal
components.
• These new axes are rotated versions of the original ones. PC₁ captures the maximum
variance in the data meaning it holds the most information while PC₂ captures the
remaining variance and is perpendicular to PC₁.
• The spread of data is much wider along PC₁ than along PC₂. This is why PC₁ is chosen
for dimensionality reduction. By projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the important structure and
patterns.
Advantages of Principal Component Analysis
1. Multicollinearity Handling: Creates new, uncorrelated variables to address issues
when original features are highly correlated.
2. Noise Reduction: Eliminates components with low variance enhance data clarity.
3. Data Compression: Represents data with fewer components reduce storage needs and
speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.
Disadvantages of Principal Component Analysis
1. Interpretation Challenges: The new components are combinations of original
variables which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data before application or results
may be misleading.
3. Information Loss: Reducing dimensions may lose some important information if too
few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are linear
and may struggle with non-linear data.
5. Computational Complexity: Can be slow and resource-intensive on very large
datasets.
6. Risk of Overfitting: Using too many components or working with a small dataset
might lead to models that don't generalize well.
Non-Negative Matrix Factorization
Non-Negative Matrix Factorization (NMF) is a technique used to break down large dataset
into smaller meaningful parts while ensuring that all values remain non-negative. This helps
in extracting useful features from data and making it easier to analyze and process it.
Matrix Decomposition and Representation in NMF
For a matrix A of dimensions 𝑚 × 𝑛 where each element is ≥ 0 NMF factorizes it into two
matrices 𝑊 and 𝐻 with dimensions 𝑚 × 𝑘 and 𝑘 × 𝑛 respectively where both matrices
contain only non-negative elements:
𝐴𝑚×𝑛 ≈ 𝑊𝑚×𝑘 𝐻𝑘×𝑛
where:
• 𝐴 : Original input matrix (a linear combination of W and H)
• 𝑊: Feature matrix (basis components)
• 𝐻: Coefficient matrix (weights associated with W)
• 𝑘: Rank (dimensionality of the reduced representation where 𝑘 ≤ min(𝑚, 𝑛)
NMF helps to identify hidden patterns in data by assuming that each data point can be
represented as a combination of fundamental features found in 𝑊.
Intuition Behind NMF
The goal of NMF is to simplify complex data into a smaller set of meaningful patterns. By
choosing a lower dimension k the decomposition highlights essential features while
ignoring noise.
• Each data point (column in 𝐴) is approximated as a combination of non-negative feature
vectors in 𝑊.
• This method assumes that data consists of meaningful parts that add up to form the
whole.
For example in facial recognition NMF can break down an image into basic facial features
such as eyes, nose and mouth. The 𝑊matrix contains these key features while the 𝐻matrix
defines how strongly each image is composed of these features.
Working of NMF
NMF decomposes a data matrix 𝐴 into two smaller matrices 𝑊and 𝐻using an iterative
optimization process that minimizes reconstruction error:
1. Initialization: Start with random non-negative values for 𝑊and 𝐻.
2. Iterative Update: Modify 𝑊and 𝐻to minimize the difference between 𝐴 and 𝑊 × 𝐻.
3. Stopping Criteria: The process stops when:
• The reconstruction error stabilizes.
• A set number of iterations is reached.
Common optimization techniques for NMF include:
• Multiplicative Update Rules: Ensures non-negativity by iteratively
adjusting 𝑊and 𝐻.
• Alternating Least Squares (ALS): Solves for 𝑊while keeping 𝐻fixed and vice versa,
in an alternating manner.
Real-life Example
Let us consider some real-life examples to understand the working of the NMF algorithm.
Let's take a case of image processing.
• Suppose we have an input image having pixels that form matrix A.
• Using NMF we factorize it into two matrices one containing the facial feature set
[Matrix W]
• Other contains the importance of each facial feature in the input image i.e. the weights
[Matrix H]. (As shown in below image)
NMF in Image Processing
Applications of NMF
NMF has a wide range of applications including:
• Image Processing: Feature extraction in facial recognition and object detection.
• Text Mining and NLP Task: Topic modeling by decomposing a document-term
matrix into key topics.
• Spectral Data Analysis: Identifying hidden patterns in sound, medical signals and
chemical spectra.
• Bioinformatics: Gene expression analysis for identifying molecular patterns in
biological data.
Manifold Learning
Manifold learning is a type of unsupervised learning technique that focuses on the nonlinear
structure of data. The goal is to reduce the dimensionality of the data while preserving the
relationships or structures within it. It is particularly useful when dealing with high-
dimensional data where linear methods like PCA might not be appropriate.
In manifold learning, the idea is to capture the underlying manifold, or surface, on which the
data resides. This underlying manifold is often of much lower dimensionality than the original
space, and capturing it can help in visualizing, understanding, and analyzing the data.
Common manifold learning techniques include:
1. t-SNE (t-Distributed Stochastic Neighbor Embedding): This technique is
particularly good at creating a two or three-dimensional map of the data, where similar
items are close together.
2. Isomap: This method works by estimating the geodesic or ‘curved’ distance between
points on the manifold.
3. Locally Linear Embedding (LLE): LLE seeks to preserve local relationships within
the data and can unfold twisted manifolds.
4. Spectral Embedding: This method uses the eigenvalues of a similarity matrix to
project the data into a lower-dimensional space.
Manifold learning can be applied to various domains such as image processing, speech
recognition, and bioinformatics. It’s particularly useful for visualizing complex data in a way
that highlights intrinsic relationships.
While manifold learning methods are powerful, they are not without their challenges. They can
be sensitive to noise, tuning parameters might be tricky, and they might be computationally
expensive for very large datasets.
Clustering
1. K-Means Clustering Algorithm
K-Means is a widely used clustering algorithm that partitions data points into K clusters
based on their similarity. The algorithm works by iteratively updating the cluster
centroids until convergence is achieved.
K-Means algorithm Steps:
1. Choose the number of clusters, K.
2. Randomly initialize K centroids.
3. Assign each data point to the nearest centroid.
4. Recalculate the centroid of each cluster.
5. Repeat steps 3 and 4 until convergence is achieved.
The distance metric used to determine the nearest centroid can be Euclidean,
Manhattan, or any other distance metric of choice.
Advantages of K-Means clustering algorithm :
▪ Easy to implement and interpret
▪ Fast and efficient for large datasets
▪ Works well with spherical clusters
Disadvantages of K-Means clustering algorithm :
▪ Assumes equal cluster sizes and variances
▪ Sensitive to initial centroid positions
▪ Can converge to local optima
Agglomerative Clustering in Machine Learning
Agglomerative clustering is a hierarchical clustering algorithm that starts with each data point
as its own cluster and iteratively merges the closest clusters until a stopping criterion is reached.
It is a bottom-up approach that produces a dendrogram, which is a tree-like diagram that shows
the hierarchical relationship between the clusters. The algorithm can be implemented using the
scikit-learn library in Python.
Agglomerative Clustering Algorithm
To group similar data points into clusters based on their proximity, Agglomerative Clustering
is used which is a type of hierarchical clustering. It follows a bottom-up approach, where each
data point starts as its own cluster and gradually merges with others based on similarity.
• The merging continues until all points form a single cluster or a set number of clusters
remain.
• It uses distance metrics like Euclidean or Manhattan distance to measure similarity.
• The process is often visualized using a dendrogram, which shows the hierarchy of
cluster formation.
• Common linkage methods include single, complete, average and ward linkage.
Animal Categorization Tree
Workflow
Lets dicuss step by step how it works:
1. Start with all points separate:
• Treat each data point as its own cluster like A, B, C, ...
• Initially, you have n clusters for n data points.
2. Compute pairwise distances:
• Calculate the distance between every pair of clusters.
• Common choices include Euclidean, Manhattan or Cosine distance.
• Store these values in a distance matrix.
3. Merge the nearest clusters:
• Identify the two clusters that are closest based on the chosen linkage method such as
single, complete, average or Ward linkage.
• Combine them into a single new cluster.
4. Update distances:
• Recalculate the distances between the newly formed cluster and all remaining clusters.
• Use the same linkage rule to ensure consistency.
5. Repeat the process:
• Continue merging clusters and updating distances iteratively.
• Stop when you reach a predefined number of clusters (k) or a distance threshold.
6. Visualize the results:
• Create a dendrogram to visualize how clusters merged at each step.
• Choose a suitable cut on the dendrogram to obtain the final cluster groups.
DBSCAN Clustering
DBSCAN is a density-based clustering algorithm that groups data points that are closely
packed together and marks outliers as noise based on their density in the feature space. It
identifies clusters as dense regions in the data space separated by areas of lower density. Unlike
K-Means or hierarchical clustering which assumes clusters are compact and spherical,
DBSCAN perform well in handling real-world data irregularities such as:
• Arbitrary-Shaped Clusters: Clusters can take any shape not just circular or convex.
• Noise and Outliers: It effectively identifies and handles noise points without assigning
them to any cluster.
Data Sets with clustering algorithms
The figure above shows a data set with clustering algorithms: K-Means and Hierarchical
handling compact, spherical clusters with varying noise tolerance while DBSCAN manages
arbitrary-shaped clusters and noise handling.
Key Parameters in DBSCAN
1. eps: This defines the radius of the neighborhood around a data point. If the distance between
two points is less than or equal to eps they are considered neighbors. A common method to
determine eps is by analyzing the k-distance graph. Choosing the right eps is important:
• If eps is too small most points will be classified as noise.
• If eps is too large clusters may merge and the algorithm may fail to distinguish between
them.
2. MinPts: This is the minimum number of points required within the eps radius to form a
dense region. A general rule of thumb is to set MinPts >= D+1 where D is the number of
dimensions in the dataset.
For most cases a minimum value of MinPts = 3 is recommended.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. Core points which have a sufficient number of neighbors within a specified radius
(eplison)
2. Border points which are near core points but lack enough neighbors to be core points
themselves
3. Noise points which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting density-reachable points,
DBSCAN forms clusters without relying on rigid assumptions about their shape or size.
DBSCAN Algorithm
Steps in the DBSCAN Algorithm
1. Identify Core Points: For each point in the dataset count the number of points within
its eps neighborhood. If the count meets or exceeds MinPts mark the point as a core
point.
2. Form Clusters: For each core point that is not already assigned to a cluster create a
new cluster. Recursively find all density-connected points i.e points within
the eps radius of the core point and add them to the cluster.
3. Density Connectivity: Two points a and b are density-connected if there exists a chain
of points where each point is within the eps radius of the next and at least one point in
the chain is a core point. This chaining process ensures that all points in a cluster are
connected through a series of dense regions.
4. Label Noise Points: After processing all points any point that does not belong to a
cluster is labeled as noise.
Comparison of Clustering Algorithms
Evaluating Clustering Algorithms
Clustering metrics are used to evaluate all clustering algorithms. Since clustering is
unsupervised, accuracy isn’t straightforward like in classification. Instead, we rely on
evaluation [Link] us take a look at some of the most commonly used clustering metrics:
1. Silhouette Score
The Silhouette Score is a way to measure how good the clusters are in a dataset. It helps us
understand how well the data points have been grouped. The score ranges from -1 to 1.
• A score close to 1 means a point fits really well in its group (cluster) and is far from
other groups.
• A score close to 0 means the point is on the border between two clusters.
• A score close to -1 means the point might be in the wrong cluster.
Silhouette Score (S) for a data point i is calculated as:
𝑏(𝑖) − 𝑎(𝑖)
𝑆(𝑖) =
𝑚𝑎𝑥(𝑎(𝑖), 𝑏(𝑖))
where,
• a(i) is the average distance from i to other data points in the same cluster.
• b(i) is the smallest average distance from i to data points in a different cluster.
2. Davies-Bouldin Index
The Davies-Bouldin Index (DBI) helps us measure how good the clustering is in a dataset. It
looks at how tight each cluster is (compactness), and how far apart the clusters are (separation).
• Lower DBI = better, clearer clusters
• Higher DBI = messy, overlapping clusters
A lower score is better, because it means:
• Points in the same cluster are close to each other.
• Different clusters are far apart from one another.
Davies-Bouldin Index (DB) is calculated as:
1 𝑘 𝑅𝑖𝑖 + 𝑅𝑗𝑗
𝐷𝐵 = ∑ max𝑗≠𝑖 ( )
𝑘 𝑖=1 𝑅𝑖𝑗
where,
• k is the total number of clusters.
• 𝑅𝑖𝑖 is the compactness of cluster i.
• 𝑅𝑗𝑗 is the compactness of cluster j.
• 𝑅𝑖𝑗 is the dissimilarity (distance) between cluster i and cluster j.
3. Calinski-Harabasz Index (Variance Ratio Criterion)
The Calinski-Harabasz Index measures how good the clusters are in a dataset.
It looks at:
• How close the points are inside each cluster?
• How far apart the clusters are?
A higher score is better, as it means the clusters are tight and well-separated. It helps
determine the ideal number of clusters.
Calinski-Harabasz Index (CH) is calculated as:
𝐵 𝑁−𝐾
𝐶𝐻 = ×
𝑊 𝐾−1
where,
• B is the sum of squares between clusters.
• W is the sum of squares within clusters.
• N is the total number of data points.
• K is the number of clusters.
Calculating between group sum of squares (B)
𝐾
𝐵=∑ 𝑛𝑘 ×∣∣ 𝐶𝑘 –𝐶 ∣∣2
𝑘=1
where,
• 𝑛𝑘 is the number of observation in cluster 'k'
• 𝐶𝑘 is the centroid of cluster 'k'
• C is the centroid of the dataset
• K is number of clusters
Calculating within the group sum of squares (W)
𝑛𝑘
𝑊=∑ ∣∣ 𝑋𝑖𝑘 –𝐶𝑘 ∣∣2
𝑖=1
where,
• 𝑛𝑘 is the number of observation in cluster 'k'
• 𝑋𝑖𝑘 is the i-th observation of cluster 'k'
• 𝐶𝑘 is the centroid of cluster 'k'
4. Adjusted Rand Index (ARI)
The Adjusted Rand Index (ARI) helps us measure how accurate a clustering result is by
comparing it to the true labels (ground truth).
It checks how well the pairs of points are grouped:
• Are the same pairs together in both the real and predicted clusters?
• Are different pairs also kept apart correctly?
The score ranges from -1 to 1:
• 1 means perfect match - the clustering is exactly right.
• 0 means random guess - no better than chance.
• Below 0 means worse than random - very poor clustering.
Adjusted Rand Index (ARI) is calculated as:
(𝑅𝐼−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑅𝐼 )
𝐴𝑅𝐼 =
(𝑚𝑎𝑥(𝑅𝐼) − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑅𝐼 )
where,
• RI is the Rand Index.
• 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑅𝐼 is the expected value of the Rand Index.
5. Mutual Information (MI)
Mutual Information measures how much two variables are related or connected. In clustering,
it compares how much the true cluster labels match with the predicted labels. It shows
how much knowing about one variable helps us predict the other. The more agreement there is,
the higher the score.
• Higher values mean better agreement between the clusters.
• Zero means no agreement at all.
MI between true labels Y and predicted labels Z is calculated as:
𝑝(𝑦𝑖 , 𝑧𝑗 )
𝑀𝐼(𝑦, 𝑧) = ∑ ∑ 𝑝( 𝑦𝑖 , 𝑧𝑗 ) ⋅ log ( )
𝑗 𝑝(𝑦𝑖 ) ⋅ 𝑝(𝑧𝑗 )
𝑖
where,
• 𝑦𝑖 is a true label.
• 𝑧𝑖 is a predicted label.
• 𝑝(𝑦𝑖 , 𝑧𝑖 ) is the joint probability of 𝑦𝑖 and 𝑧𝑗 .
• 𝑝(𝑦𝑖 ) and 𝑝(𝑧𝑖 ) are the marginal probabilities.
These clustering metrics help in evaluating the quality and performance of clustering
algorithms, allowing for informed decisions when selecting the most suitable clustering
solution for a given dataset.
Summary of Clustering Methods
Clustering in machine learning groups unlabeled data points into clusters based on similarity,
enabling pattern discovery without supervision. Common methods fall into categories like
partitioning, hierarchical, density-based, and model-based approaches.
Partitioning Methods
These divide data into a predefined number of clusters (k) using centroids. K-Means initializes
k centroids, assigns points to the nearest, and iteratively updates centroids until convergence;
it's efficient for spherical clusters but sensitive to initialization and outliers. K-Medoids (e.g.,
PAM) uses actual data points as centers (medoids) for robustness to noise.
Hierarchical Methods
These build a tree of clusters without specifying k upfront. Agglomerative (bottom-up) starts
with each point as a cluster and merges closest pairs; divisive (top-down) splits a single cluster
recursively. They produce dendrograms for visualization but scale poorly to large datasets.
Density-Based Methods
These identify clusters as dense regions separated by sparse areas, handling arbitrary shapes
and noise. DBSCAN defines core points by neighborhood density (eps, minPts parameters)
and expands clusters from them; OPTICS and HDBSCAN extend it for varying densities.
Model-Based Methods
These assume data follows a mixture of probability distributions, like Gaussian Mixture
Models (GMM), which use EM algorithm to fit parameters and soft-assign points via
probabilities. Useful for overlapping clusters but computationally intensive.
Method
Key Algorithms Strengths Weaknesses
Type
K-Means, K- Needs k predefined, assumes
Partitioning Fast, scalable
Medoids spherical clusters
No k needed,
Hierarchical Agglomerative O(n²) time
hierarchical view
Density- DBSCAN,
Handles noise/shapes Parameter-sensitive
Based OPTICS
Probabilistic
Model-Based GMM Assumes distributions
assignments