Unsupervised Learning
Unsupervised learning is a machine learning technique used to discover natural groupings or hidden
patterns in datasets that do not have predefined categories or labels.
• The Goal: To identify the underlying structure of unlabeled data.
• Mechanism: The model uses similarity or distance measures (like Euclidean distance) to
group data points with shared characteristics.
• Output: Each data point is assigned a cluster ID or represented in a lower-dimensional
space.
• Real-Life Example: A store has customer purchase data but doesn’t know their segments.
Unsupervised learning can group customers who buy similar products (e.g., "baking
enthusiasts" vs. "snack buyers") without being told what those groups are.
Clustering
Clustering in Machine Learning is an unsupervised technique that groups similar data points
together without predefined labels. Such that points in the same group are more similar to each
other than points in other groups.
Types:
• Hard Clustering: Each data point is strictly assigned to exactly one cluster with clear
boundaries (e.g., a customer is either in "Cluster 1" or "Cluster 2").
• Soft (Fuzzy) Clustering: Assigns each point a probability or degree of membership to
multiple clusters simultaneously.
• Example: A medical diagnosis might show a patient has a 70% match for one
condition and 30% for another.
Clustering Methods:
• Centroid-based.
Data points are grouped around a central point called a centroid. Each cluster has one centroid.
Working.
1. Choose k (number of groups).
Decide how many clusters you want. Example: k = 5.
2. Initialize centroids.
Pick random points from the dataset as starting “pins” (centroids).
3. Assign points to nearest centroid.
Each data point checks which centroid is closest and joins that group. 👉 This forms the
first rough grouping.
4. Update centroids.
• For each group, calculate the average position of all points in that group:
• Average x and y pos of all point of that group.
• That average becomes the new centroid.
5. Reassign points.
With updated centroids, each point checks again like which point is closest new centroid and
that new point join that centroid group. Some points may switch groups.
6. Repeat until stable.
Keep looping assign → update → reassign. When centroids stop moving (or move very
little), the groups are finalized.
Types.
K-Means Clustering (Centroid-based)
K-Means groups the data into k clusters by minimizing the distance between points and their cluster
centers (centroids).
Working:
1. Choose K value (number of clusters).
2. Initialize centroids (average of data points position) randomly.
3. Assign each point to the nearest centroid.
4. Update centroids (average of points in each cluster).
5. Repeat assign → update until centroids stabilize.
Key: Hard assignment (each point belongs to exactly one cluster).
Elbow Method:
A technique to find the best number of clusters in K-Means.
Working:
1. Run K-Means for different values of k. like k = 2, run k-means then k = 3, run again...
2. Calculate distance of each point to centroid and take its square root and once all distances
finished then sum all distances (do same for each iteration).
3. Plot WCSS vs k.
4. The “elbow point” (where the curve bends) indicates the optimal .
Key: Balances between too few clusters (high error) and too many clusters (overfitting).
K-Means ++
Improved version of K-Means that chooses better initial centroids to avoid poor clustering.
Working:
1. Pick the first centroid randomly.
2. For each next centroid, choose a point farther from existing centroids (probability
proportional to distance squared).
3. Run K-Means normally.
Key: Reduces chances of bad initialization → faster convergence and better clusters.
• Real-Life Example: A delivery company uses K-Means to find the central locations for 5
warehouses to minimize the total travel distance to all their customers.
K-Modes Clustering
Extension of K-Means for categorical data (non-numeric, like colors, names, categories).
Working:
1. Replace Euclidean distance with matching dissimilarity measure (if categories match (0),
if don’t match (1) and sum the all those number / total num of columns)
2. Replace mean (centroid) with mode (most frequent category).
3. Assign points to nearest mode.
4. Update modes until stable.
Key: Works for categorical attributes (e.g., survey data, product categories). It make cluster from
catrgories who matches.
Fuzzy C-Means (FCM) Clustering
Soft clustering method where each point can belong to multiple clusters with different membership
probabilities.
Working:
1. Choose (number of clusters).
2. Initialize cluster centers.
3. Assign each point a membership value for each cluster (between 0 and 1).
4. Update cluster centers based on weighted average of points.
5. Repeat until membership values stabilize.
Key: Unlike K-Means (hard assignment), FCM allows soft assignment (e.g., a point can be 70% in
Cluster A, 30% in Cluster B).
• Distribution-based:
Clusters are formed by assuming the data is generated from a mixture (Data Point 70% belong to
class A and 30% to class B) of probability distributions (often Gaussian/Normal). Each cluster is
represented by a distribution, not just a centroid.
Working.
1. Assume k distributions. Decide how many clusters (distributions) you want. Example: k=3.
2. Initialize parameters. Each distribution has:
• Mean (center)
• Variance (spread)
• Mixing weight (probability of belonging to that distribution)
3. Expectation step (E-step).
• For each point, calculate the probability that it belongs to each distribution.
• Points are not assigned hard to one cluster — instead, they get soft membership
(like 70% cluster A, 30% cluster B).
4. Maximization step (M-step).
• Update the parameters (mean, variance, weight) of each distribution based on the
probabilities from the E-step.
• Distributions points shift to better fit the data.
5. Repeat until stable.
• Keep looping E-step → M-step.
• When parameters stop changing much, the distributions are finalized.
Types.
Expectation-Maximization (EM) Algorithm
An iterative algorithm used to estimate parameters of probabilistic models (like GMMs).
Working Steps:
1. E-step (Expectation):
• Calculate the probability of each point belonging to each cluster (soft assignment).
2. M-step (Maximization):
• Update cluster global parameters (mean, variance, weight) using those all points
probabilities.
3. Repeat E → M until convergence.
Key: EM alternates between guessing cluster memberships and refining parameters until the model
fits the data well.
Gaussian Mixture Models (GMMs)
It’s a probabilistic clustering method where data is assumed to come from a mixture of Gaussian
(Normal) distributions. Each cluster represent its mean (thickness), variance (wideness), and
weight (total points).
Working:
1. Assume Gaussian distributions (mixture distribution).
2. Initialize parameters (mean, variance, weight like randomly assign values to parameters
like 2, 5, 8).
3. Use the Expectation (get point percentage belongings) and Maximization (throw point
to that most belonging cluster and update parameters) algorithm to refine parameters.
4. Repeat until parameters stabilize.
Key: Handles overlapping clusters better than K-Means by using probabilities instead of hard
assignments.
Dirichlet Process Mixture Models (DPMMs)
A Bayesian non-parametric clustering method that extends GMMs. Instead of fixing the number of
clusters , DPMMs allow the model to adaptively decide how many clusters are needed.
Working:
1. Assume data comes from an infinite mixture of distributions.
2. Check similarity: When a new data point arrives, the model checks if it fits well with an
existing cluster (distribution).
• Join if similar: If the point is similar enough, it’s added to that cluster.
• Create new if different: If it doesn’t fit, then a new cluster is created for that new point.
3. As data is processed, clusters are formed dynamically.
4. Inference is done using methods like Gibbs sampling or variational inference.
Key: Unlike GMMs, you don’t need to predefine . The model discovers the number of clusters
automatically.
• Hierarchical Clustering (Connectivity Based Method)
Clusters are formed by building a tree-like structure (dendrogram). It groups data step by step,
either by merging smaller clusters into bigger ones (agglomerative) or by splitting bigger clusters
into smaller ones (divisive).
Working.
1. Treat each point as its own cluster. Start with clusters (one per data point).
2. Find closest clusters. Measure distance between clusters (using Euclidean distance,
Manhattan distance, or linkage methods like single, complete, average).
3. Merge clusters. Combine the two closest clusters into one.
4. Repeat merging. Keep merging step by step until all points are in a single cluster.
5. Build dendrogram. The process creates a tree diagram showing how clusters merge at each
step.
6. Cut the tree. Decide the number of clusters by “cutting” the dendrogram at a chosen level.
Types.
Agglomerative Clustering (Bottom-Up)
A type of hierarchical clustering where each point starts as its own cluster, and clusters are merged
step by step.
Working:
1. Treat each point as a cluster.
2. Find the two closest clusters (using linkage: single, complete, average).
3. Merge them into one cluster.
4. Repeat until all points are merged into one big cluster.
5. Cut the dendrogram to get desired clusters.
Key: Most common hierarchical method; builds clusters from the bottom up.
Divisive Clustering (Top-Down)
Opposite of agglomerative — start with all points in one cluster and split step by step.
Working:
1. Treat all points as one cluster.
2. Split the cluster into smaller groups based on dissimilarity.
3. Keep splitting until each point is its own cluster.
4. The dendrogram shows the splits.
5. Cut the tree to get desired clusters.
Key: Less common; builds clusters from the top down.
Affinity Propagation
This method identifies representative points and assigns them other points.
Working:
1. All points start as potential leaders (share scores with each other and higher score points will
become representative points (leader).
2. Points exchange “messages” about how suitable they are to be representative points.
• Responsibility: How well a point would serve as an representative point.
• Availability: How appropriate it is for another point to choose that representative
point.
3. Iteratively update these messages.
4. Converge to a set of representative points and clusters around them.
Key:
• Doesn’t require predefining .
• Automatically finds the number of clusters.
• Works well for large datasets.
• Density-based (Model-based methods) / DBSCAN.
Data points are grouped based on density of points in a region. Clusters are formed where points are
closely packed together, and areas with low density are treated as noise or outliers.
Working
1. Set parameters.
• (epsilon): set radius of neighborhood.
• minPts: set minimum number of points required to form a dense region.
2. Pick a point.
• If that point has at least min number of neighbors points within radius, it becomes a
core point.
• Otherwise, it’s marked as noise or a border point.
3. Form a cluster.
• All neighbors points near core point make a cluster including core point in it.
• If those neighbors are also core points, their neighbors points are added too. 👉 This
expands the cluster outward.
4. Repeat for all points.
• Continue until all points are either assigned to a cluster or marked as noise.
5. Result.
• Dense regions become clusters.
• Sparse regions remain empty or are labeled as noise.
Reinforcement learning.
• Model-Based Methods.
These methods use a environment model to predict outcomes and help the agent plan actions by
simulating the actions and estimating their results, then chooses the policy that maximizes the
reward.
Markov Decision Processes (MDPs)
It is a mathematical framework to model decision-making in uncertain environments where
outcomes are partly random and partly under the agent’s control. (eg. throw dice as action by agent
but outcomes is uncertain and random).
Components:
• States (S): Situation the agent can be in.
• Actions (A): Choices available to the agent.
• Transition probabilities (P): Probability of moving from one state to another given an
action.
• Rewards (R): A reward is a number given to the agent after it takes an action. If the reward
is positive, it means the result of the action was good. If the reward is negative it means the
outcome was bad or there was a penalty help the agent learn what’s good or bad.
• Policy (π): A policy is the agent’s plan. The goal is to find the best policy that helps the
agent earn the highest total reward over time.
Goal: Find a policy that maximizes expected cumulative reward.
Bellman Equation
It helps to predict best future action to take, which gives maximum reward. It get current reward
plus future action reward so it knows in advance which future action will give best reward.
Value Iteration
Value Iteration is an algorithm that finds the best policy in a Markov Decision Process (MDP) by
repeatedly updating (changing the guess value of a state based on new information.) the value of
each state using the Bellman equation until the values (guesses) stop changing.
📘 Monte Carlo Tree Search (MCTS)
A search algorithm that uses random simulations to execute action and estimate its results. It used
for problems with very large decision spaces. It builds a search tree step-by-step using random
simulations to choose better actions.
Steps:
1. Selection: Traverse the tree from root to leaf using a selection policy. The most common
rule is UCT which balances:
• Exploitation: Choosing actions which have higher reward based on past experience.
• Exploration: Try actions that have not been tried much or yet.
2. Expansion: When the selection phase reaches a leaf node that isn't the end, the algorithm
expands the tree by adding one or more child nodes representing possible actions from that
state.
3. Simulation (Rollout): Play out randomly until terminal state.
4. Backpropagation: Update values of nodes along the path with simulation results.
Key Features:
• Balances exploration vs exploitation.
• Doesn’t require full knowledge of transition probabilities.
• Effective in large search spaces (like Go, Chess).
Semi supervised learning.
It uses a mix of labeled and unlabeled data when labeling data is costly or it is very limited.
Semi-Supervised Classification
A learning approach that uses a small amount of labeled data together with a large amount of
unlabeled data to train a classifier.
Why:
• Labeling is expensive and time-consuming.
• Unlabeled data is abundant.
• Semi-supervised methods leverage both to improve accuracy.
How it works:
1. Train an initial model on the small labeled dataset.
2. Use the model to infer labels (pseudo-labels) for unlabeled data.
3. Retrain the model using both labeled and pseudo-labeled data.
4. Iterate until performance stabilizes.
Example:
• Spam detection: You have 100 labeled emails (spam/ham) and 10,000 unlabeled emails.
Semi-supervised learning uses both to build a stronger classifier.
Self-Training
A simple semi-supervised method where the model teaches itself by labeling unlabeled data.
Steps:
1. Train a classifier on the labeled data.
2. Use this trained classifier to predict labels for unlabeled data.
3. Select the most confident pseudo-labeled predictions (high probability).
4. Add those pseudo-labeled examples to the training set with labels data.
5. Retrain the classifier.
6. Repeat until no improvement.
Key Idea: The model “bootstraps” itself — gradually expanding its training set with pseudo-labels.
Risk: If early predictions are wrong, errors can propagate.
Few-Shot Learning
Few-Shot Learning (FSL) is a machine learning approach where models learn to perform new tasks
or recognize new classes using only a few labeled samples.
It is used to classify or generalize from only a few labeled examples per class.
It is designed for situations where we only have a few labeled examples per class. Instead of
training a huge classifier, the model learns to compare new samples (queries) against the small
labeled set (support) and generalize.
How It Works
1. Embed samples: Convert both support and query examples into feature vectors using a
neural network encoder.
2. Compare: Measure similarity between query and support embeddings using functions like
cosine similarity or Euclidean distance.
3. Predict: Assign the label to query based on the closest support example or prototype
(average embedding per class).
4. Adapt: Some methods fine-tune parameters on the support set for better performance on
queries.
Example
• Support Set: 1 image of a cat, 1 image of a dog.
• Query: A new image of a dog.
• Model encodes the query, compares it to support embeddings, finds it closest to “dog,” and
predicts “Dog.”
Why important:
• In real life, you often don’t have thousands of examples.
• Few-shot learning aims to mimic human ability to learn from just a handful of samples.
Techniques:
• Meta-Learning (“learning to learn”): Train models on many tasks so they can quickly
adapt to new ones.
• Metric-based methods: Learn embeddings so that new examples can be classified by
similarity (e.g., Siamese networks, prototypical networks).
• Data augmentation: Generate synthetic examples to expand the few samples.
Forecast Models.
ARIMA (Auto Regressive Integrated Moving Average)
ARIMA is a forecasting model for time series data means dynamic number in sequence like number
in a sequence but changes ups downs like 1, 3, 6, 10, 2, 0…
It basically learn from past mistakes and experience to predict future values.
It combines three components:
• AR (Auto Regression): Dependence on past values.
• I (Integrated): Detecting changes in data like ups and downs.
• MA (Moving Average): Learns from past forecast errors.
Working Principles
1. Stationarity check: Ensure time series data is stationary ( mean/variance are constant ).
2. Parameter estimation: use ACF and PACF plots to decide differencing order .
3. Model fitting: Estimate coefficients using maximum likelihood to minimize forecast
errors, then fit the ARIMA model to the data.
4. Forecasting: Use fitted ARIMA model to predict future values.
SARIMA (Seasonal Auto Regressive Integrated Moving Average)
SARIMA extends ARIMA by adding seasonal components to handle repeating patterns in time
series (e.g., monthly sales spikes, yearly demand cycles). It captures both trend and seasonality.
Components
SARIMA is written as:
• Non seasonal part (p,d,q): Same as ARIMA.
• P : autoregressive terms
• q : differencing order
• d : moving average terms
• Seasonal part (P,D,Q): Captures repeating cycles.
• P : seasonal autoregressive terms
• Q : seasonal differencing order
• D : seasonal moving average terms
• m : length of seasonal cycle (e.g., 12 for monthly data, 4 for quarterly data)
Working Principles
1. Stationarity check: Apply both regular and seasonal differencing if needed.
2. Parameter estimation: Use ACF/PACF plots to identify seasonal and non seasonal orders.
3. Model fitting: Estimate coefficients using maximum likelihood to minimize forecast errors.
4. Forecasting: Predict future values while accounting for both trend and seasonal cycles.
Difference in One Line
• ARIMA: Remembers changes in the past to forecast future values.
• SARIMA: Remembers changes in the past plus captures repeating cycles (seasonal
patterns).
Quick Example
• ARIMA: Stock prices → trend + random ups/downs.
• SARIMA: Monthly sales → trend + December holiday spike every year.
👉 In short:
ARIMA = past changes.
SARIMA = past changes + repeating seasonal cycle.
Exponential Smoothing for Time Series Forecasting
Exponential smoothing is a forecasting method that gives more weight to recent observations
while older data gets exponentially less weight. It assumes the future will resemble the recent past
and can be extended to handle trend and seasonality.
Types of Exponential Smoothing
1. Simple Exponential Smoothing (SES)
• Used when data has no trend or seasonality.
• Uses one parameter: α (alpha), the smoothing factor ().
• Small α → smoother forecasts (more weight on past).
• Large α → more responsive to recent changes.
2. Double Exponential Smoothing (Holt’s Method)
• Used when data has a trend but no seasonality.
• Adds a trend component with parameter β (beta).
• Captures both level and trend for better forecasts.
3. Triple Exponential Smoothing (Holt Winters Method)
• Used when data has trend + seasonality.
• Adds a seasonal component with parameter γ (gamma).
• Three parameters:
• α (level)
• β (trend)
• γ (seasonality)
• Two variants:
• Additive: seasonality is constant.
• Multiplicative: seasonality scales with data size.
• Most precise but also most complex.
Part 3: Deep Learning and Advanced ML Paradigms
Transfer Learning
A technique where a model is trained on a task and we give this pre-trained model another task
similar to task from which it is trained. So model learn and adapt the differences by fine tuning
instead of start from scratch to train it.
Key Idea: Leverage knowledge from a large dataset to improve performance on a seperate smaller
dataset.
Example: Using a pre-trained dog images recognize model for detect cat images as both have facial
features like eyes, nose, ears... etc etc.
Deep Learning
Definition: A subset of machine learning that uses artificial neural networks with many layers to
automatically learn complex patterns.
Key Idea: Learns hierarchical layers like low level layer detect edges like horizantle and vertical
lines, medium level layer detect features like eyes, nose and high level layer detect shapes like car,
apple. cat directly from raw data.
Example: Speech recognition, image classification, natural language processing.
ML vs Deep Learning
Aspect Machine Learning (ML) Deep Learning (DL)
Algorithms learn from features engineered Neural networks learn features
Definition
by humans automatically
Data Need Works with smaller datasets Requires large datasets
Aspect Machine Learning (ML) Deep Learning (DL)
Computation Less computationally intensive High computational power (GPUs)
Interpretability Easier to interpret Often a “black box”
Example Decision trees, SVM CNNs, RNNs, Transformers
👉 In short: ML = manual features, DL = automatic feature learning.
Generative AI
AI models that can generate new content (text, images, audio, code) based on training data.
Key Idea: Learns patterns and distributions of data, then produces novel outputs.
Examples of Models:
• Dall-E (Images from Text)
• Sora (Video from Text)
• Transformers (e.g., GPT)
Applications of Generative AI
• Text: Chatbots, content creation, summarization.
• Images: Art generation, design, medical imaging.
• Audio: Music composition, voice synthesis.
• Code: Automated programming assistance.
• Video: Synthetic video generation, animation.
Compiler Information
Definition: A compiler is a program that translates high-level source code into machine code that
a computer can execute.
Key Stages:
1. Lexical Analysis: Breaks code into tokens.
2. Syntax Analysis (Parsing): Checks grammar/structure.
3. Semantic Analysis: Validates meaning (types, scope).
4. Optimization: Improves efficiency of code.
5. Code Generation: Produces machine code.
6. Code Linking: Combines with libraries for execution.
Example: C/C++ source code compiled into executable machine code.
Dimensionality Reduction (Advanced Unsupervised)
Used to reduce the number of features in a dataset while keeping key information to prevent
overfitting and speed up computation.
• PCA (Principal Component Analysis): A linear method that transforms correlated features
into a smaller set of uncorrelated "Principal Components" and remove unimportant features
while preserving the important features which capture the most variance.
• ICA (Independent Component Analysis): Separates mixed signals into independent
sources (e.g., the Cocktail Party Problem: picking out one person’s voice from a noisy
room).
• Non-Linear Methods:
• Isomap: Preserves "geodesic distance" (the shortest path along a curved surface) to
unfold complex structures.
• LLE (Locally Linear Embedding): "Unrolls" curved data like a Swiss Roll by
preserving local relationships between neighboring points.
• t-SNE: Primarily used for visualization; it clusters similar high-dimensional data
points together in a 2D or 3D plot (e.g., visualizing the MNIST digits dataset).