0% found this document useful (0 votes)
8 views16 pages

Dimensionality Reduction Techniques Explained

The document outlines various dimensionality reduction techniques, including Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which help simplify data while retaining essential information. It discusses the advantages and limitations of dimensionality reduction, such as improved computation speed and potential information loss. Additionally, it covers feature selection methods like the Chi-square test and Recursive Feature Elimination, emphasizing their roles in enhancing model performance and interpretability.

Uploaded by

baip1066
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views16 pages

Dimensionality Reduction Techniques Explained

The document outlines various dimensionality reduction techniques, including Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which help simplify data while retaining essential information. It discusses the advantages and limitations of dimensionality reduction, such as improved computation speed and potential information loss. Additionally, it covers feature selection methods like the Chi-square test and Recursive Feature Elimination, emphasizing their roles in enhancing model performance and interpretability.

Uploaded by

baip1066
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Computational Statistics

Computational Statistics - Endsem Exam Questions


[Link] Question Marks

Unit V: Statistical Processing (08 Hours):


Dimensionality Reduction Techniques- Principal Component Analysis, Discriminant Analysis, Feature Selection- Chi2 square
method, Variance Threshold, Recursive Feature Elimination, Outliers detection methods, Resampling-Random, under-sampling
and over re-sampling. Case Studies Study about Anomalies.
1 What is Dimensionality reduction? State advantages and limitations of dimension reduction 8

[Link]

[Link]

[Link]

Dimensionality Reduction
Dimensionality reduction is a fundamental process in data analysis and machine learning used to reduce the number of
input variables or features (also called dimensions) in a dataset while retaining as much relevant information as possible.
This process simplifies data, making it more manageable for analysis, visualization, and machine learning model training.
Importance
 High-dimensional data challenges: Datasets with many features (sometimes hundreds or thousands) can cause:
o Increased computational complexity
o Overfitting of models
o Difficulty in visualizing patterns
o Poor model generalization
o Redundant or irrelevant features affecting accuracy
 Curse of dimensionality: As the number of dimensions increases, the data becomes sparse, and traditional
algorithms struggle to find meaningful patterns or relationships.
Goals
 Reduce complexity while maintaining the integrity of the data
 Improve model performance and generalization
 Prevent overfitting by eliminating irrelevant or noisy features
 Enhance visualization of high-dimensional data by reducing it to 2D or 3D
 Speed up training of machine learning models by working with fewer variables
Working
There are two main approaches:
1. Feature Selection
 Involves identifying and keeping only the most informative features.
 Discards features that add little to no value.
 Techniques include:
o Filter methods (e.g., correlation, variance threshold)
o Wrapper methods (e.g., recursive feature elimination)
o Embedded methods (e.g., LASSO regularization)
2. Feature Extraction
 Creates a new set of features by combining or transforming the original ones.
 The new features retain most of the important information.
 This often results in fewer but more meaningful variables.
 Techniques include:
o Principal Component Analysis (PCA) – unsupervised, maximizes variance
o Linear Discriminant Analysis (LDA) – supervised, maximizes class separability
For example, imagine a dataset with 50 columns describing a car (e.g., weight, speed, color, engine size). Dimensionality
reduction might reduce it to 5 key features (columns) that still capture the main patterns, like performance and size, without
losing the essence of the data.

1
Advantages of Dimensionality Reduction

Advantages Description Example

Fewer features reduce the time and A dataset with 1000 image pixels per sample is
Faster
computational resources needed for training reduced to 50 key features, enabling a face recognition
Computation
and inference. model to train 10× faster.

In spam detection, discarding rare or unrelated words


Reduced Simplifies models by removing irrelevant or
reduces the chance of fitting to noise, leading to more
Overfitting redundant data, improving generalization.
robust predictions.

Helps represent high-dimensional data in 2D or


Better PCA applied to customer purchase records reveals
3D using basic methods like PCA for pattern
Visualization distinct spending clusters when plotted in 2D.
recognition.

In gene expression data, PCA filters out random


Noise Removes low-variance or irrelevant features
fluctuations and highlights genes with meaningful
Elimination that introduce inconsistency in modeling.
variance.

A machine log dataset with 1000 sensor readings is


Storage Fewer features mean smaller data files, saves
compressed to 100 informative ones, significantly
Optimization disk space and memory usage.
reducing storage on servers.

Emphasizing key variables often boosts model A house price model using only 10 strong predictors
Improved
accuracy by focusing learning on relevant out of 50 gives more accurate results than one trained
Accuracy
information. on all features.

Limitations of Dimensionality Reduction

Limitation Description Example

Removing features risks discarding important In a healthcare dataset, aggressive feature reduction
Information Loss
patterns or correlations. may miss subtle symptoms linked to rare diseases.

After PCA, a new component may combine "income"


Lack of Transformed features from techniques like
and "spending score," making it hard to explain to
Interpretability PCA often lack meaningful names or units.
stakeholders.

PCA works best when data is linearly separable;


Not all reduction techniques work well for all
Method Suitability applying it to complex behavioral patterns might
data types—linear vs. non-linear.
oversimplify insights.

Parameter Some methods require tuning (e.g., number of Choosing too few principal components can lead to
Sensitivity components), which affects outcomes. underfitting; too many may retain noise.

In a small dataset with 6 important features, reducing


Not Always In low-dimensional or well-structured
to 3 might eliminate essential variables, hurting model
Beneficial datasets, reduction can remove useful data.
performance.

Initial Some techniques, such as PCA on large Running PCA on a dataset with millions of samples
Computational datasets, require high memory and processing and features may take hours and require powerful
Cost time initially. hardware.

2 Explain various methods of dimension reduction technique in detail 10


PCA
[Link]

[Link]

[Link]

LDA

[Link]

[Link]

2
[Link]
Dimensionality Reduction Techniques – Overview
Dimensionality reduction techniques transform high-dimensional data into a lower-dimensional form while preserving
important information. This helps in:
 Reducing computational cost
 Avoiding overfitting
 Improving visualization
 Enhancing model performance
These methods can be broadly categorized into:
 Linear vs. Non-linear Methods
 Supervised vs. Unsupervised Methods
1. Principal Component Analysis (PCA)
 Type: Linear, Unsupervised
 Description: Transforms data into new axes (principal components) that maximize variance. Retains most
important information while reducing features.
 Best for: Data compression, noise reduction, visualization.
2. Linear Discriminant Analysis (LDA)
 Type: Linear, Supervised
 Description: Projects data onto directions that maximize class separability. Uses class labels to find axes that best
discriminate between categories.
 Best for: Classification problems and feature extraction.
PCA Working
1. Standardize the Data:
Normalize all features to have zero mean and unit variance to avoid bias due to scale differences (e.g., kg vs. cm).
2. Compute the Covariance Matrix:
Measures how features vary together. It highlights correlated variables that can be combined.
3. Calculate Eigenvalues and Eigenvectors:
o Eigenvectors: The directions (principal components) capturing maximum variance.
o Eigenvalues: Indicate the magnitude of variance captured by each component.
4. Select Top-k Components:
Retain the components that account for the most variance (e.g., 95%).
5. Project the Data:
Transform the original data onto the reduced principal component space.
Mathematical Foundation
Given standardized data matrix X of size
Covariance Matrix:

Eigen Decomposition or SVD:

Where V contains eigenvectors and Λ contains eigenvalues.


Projection onto Top-k Components:

Where W is a matrix of the top-k eigenvectors.

Use Cases
 Image Compression: Reduce pixels/features while retaining image quality.
 Noise Reduction: Remove minor fluctuations in biological or sensor data.
 Preprocessing: Improve model performance by eliminating irrelevant variables.
 Exploratory Analysis: Discover hidden structure in data.

3
Strengths
 Computationally Efficient
 Preserves Global Variance
 Handles Multicollinearity
 Unsupervised – no labels required
Limitations
 Assumes Linearity
 Sensitive to Outliers
 Difficult to Interpret Components
 Requires Feature Scaling
Example
For a car dataset with features like weight, horsepower, and fuel efficiency, PCA may combine them into one or two
principal components (e.g., "vehicle performance") that capture 95% of the total variation.
LDA is a supervised dimensionality reduction technique used when the dataset contains class labels. It finds linear
combinations of features (called discriminant axes) that maximize class separation while minimizing variance within
classes.
PCA focuses on variance; LDA focuses on class discrimination.
LDA Working
1. Compute Class Means:
For each class, compute the mean feature values.
2. Calculate Scatter Matrices:
o Within-Class Scatter SW: Spread of data within each class.
o Between-Class Scatter SB: Distance between the class means.
3. Find Linear Discriminants:
Maximize the ratio:

Solve Eigenvalue Problem

Project Data:

Where W contains the discriminant directions.


Key Notes
 Maximum number of discriminants = c−1 where c is the number of classes.
 LDA emphasizes inter-class separability over general variance.
Use Cases
 Face Recognition
 Medical Diagnosis
 Text Classification
 Marketing Segmentation
Strengths
 Maximizes Class Separability
 Supervised – uses label information
 Improves Classification Accuracy
Limitations
 Requires Labeled Data
 Can Only Reduce to c−1 Dimensions
 Assumes Gaussian Distribution

4
 Sensitive to Class Imbalance
Example
A wine classification dataset with features like acidity, alcohol, and sugar is reduced from 10 to 2 LDA dimensions that
best separate wine types (e.g., red vs. white vs. rosé).
PCA vs. LDA – Comparison Table

PCA LDA

Type Unsupervised classification problem Supervised – classification problem

Objective Maximize total variance Maximize class separability

Requires Labels? No Yes

Output Dimensions Up to number of original features Up to c−1, where c = number of classes

Assumptions Linear relationships Linear class boundaries; Gaussian class distributions

Feature Transformation Orthogonal basis of maximum variance Optimal class separation directions

Best For Compression, noise reduction Classification, class-focused feature selection

Example Image compression Face or speech recognition

Practical Tips for Both

Consideration PCA LDA

Standardization Required Required

Outlier Sensitivity High Moderate

Visualization Great for structure discovery Great for labeled data visualization

Tool Support [Link]. sklearn.discriminant_analysis.

Dimensionality Limit Can reduce to any number ≤ features Can reduce to c−1 dimensions

3 Explain in detail the Chi-square Test for feature selection with the help of a suitable example. 7

[Link]

[Link]

[Link]

Feature Selection and Chi-Square Test


Feature Selection: The process of selecting a subset of relevant features to improve machine learning model performance
by reducing dimensionality, eliminating irrelevant/redundant features, enhancing interpretability, reducing computational
cost, and mitigating overfitting.
The Chi-Square Test is a statistical hypothesis test used to determine if there is a significant association between two
categorical variables.
Context of feature selection:
 One variable is a feature (independent variable).
 The other is the target (dependent variable).
 The goal is to select the most relevant features that are significantly associated with the target.
Context:
 The feature and the target must be categorical.
 Used primarily in classification problems (e.g., Yes/No, Spam/Not Spam, Purchase/No Purchase).
 It’s part of the filter method of feature selection.
Assumptions of Chi-Square Test
1. Both variables should be categorical.
2. Observations should be independent.
3. Sample size should be reasonably large.
4. Expected frequency in each cell of the contingency table should be at least 5.
Intuition
The test compares:

5
 Observed frequencies (O): How often categories actually occur.
 Expected frequencies (E): How often categories would occur if there was no association.
It then calculates how far the observed is from the expected using:
Formula:

Step-by-Step Procedure:
Step 1: Prepare the Data
 Ensure features and target are categorical.
 If not, apply label encoding or binning.
Step 2: Create a Contingency Table
This is a matrix showing the frequency of feature categories vs target classes.
Example:

Gender Purchased = Yes Purchased = No Total

Male 3 2 5

Female 2 3 5

Step 3: Compute Expected Frequencies


Use:

Step 4: Apply the Chi-Square Formula

Step 5: Determine Significance


 Compute the degrees of freedom:

r: Number of rows (categories of the feature).


c: Number of columns (categories of the target).
Use a Chi-Square table to find the critical value at a significance level (e.g., 0.05).
If χ2 calculated > χ2 critical, then the feature is significantly associated with the target.
Example:
Dataset:

Gender Purchase

Male Yes

Female No

Male No

Female Yes

Male Yes

Contingency Table:

Gender Yes No Total

Male 2 1 3

Female 1 1 2

Total 3 2 5

6
Expected Frequencies:

Gender Yes (E) No (E)

Male 3×3/5 = 1.8 3×2/5 = 1.2

Female 2×3/5 = 1.2 2×2/5 = 0.8

Degrees of freedom, df=(r−1)(c−1)= (2-1) × (2-1) = 1


Gender Purchase Observed (O) Expected (E) (O−E)²/E
Male Yes 2 1.8 0.0222
Male No 1 1.2 0.0333
Female Yes 1 1.2 0.0333
Female No 1 0.8 0.05
Formula: χ² = Σ (O - E)² / E 0.1388
Critical Value at α = 0.05 (df=1): 3.841
Since 0.1388 < 3.841 → we fail to reject the null hypothesis.
→ Gender is NOT significantly associated with Purchase.
The Chi-Square Test is a simple and effective way to select categorical features that are statistically related to the target. It
compares observed vs expected frequencies to determine significance. In our example, Gender was not a good predictor of
Purchase.
4 Describe Recursive Feature Elimination with examples. 8

[Link]

[Link]

[Link]

Recursive Feature Elimination (RFE) is a feature selection technique used in machine learning to identify the most
important features (or variables) for building a predictive model. It works by recursively removing the least important
features and building the model repeatedly until the desired number of features is reached.

Clarity for “recursively removing”: removing one (or more) features each time, and doing this again and again until only
the most important features remain.
Steps in RFE:
1. Train the model on the current set of features.
2. Rank the features by importance (e.g., coefficients in linear models or feature importance in tree models).
3. Remove the least important feature(s).
4. Repeat the process until the desired number of features is reached.
Recursive Feature Elimination (RFE): Mathematical Steps
Let’s take an example dataset with 5 features:

Suppose we use Logistic Regression as the estimator.


Train the Model
We train a Logistic Regression model:

7
The log-likelihood function is maximized to estimate the coefficients

Eliminate the Least Important Feature, rank features (say X 4)

Remove X 4 , and repeat model training with remaining features:

Remove next least important feature


Train again:

Calculate new coefficients → rank again → remove the next least important.
Repeat until the desired number of features (e.g., 2 or 3) remains.
Note on Other Models:
If we use Random Forest instead of Logistic Regression:
 Feature importance = decrease in Gini impurity or entropy.

Summary of steps:
Train model → estimate coefficients β
Rank features using ∣β∣ (or Gini importance, etc.)
Eliminate lowest one
Repeat with reduced feature set
Example: (answer accuracy not needed)
Sample X1 (Study Hours) X2 (Attendance %) X3 (Sleep Hours) X4 (Social Media Hours) y (Pass=1)
1 4 90 7 2 1
2 1 70 6 5 0
3 3 80 6.5 4 1
4 2 60 5 6 0
5 5 95 8 1 1
Step 1: Logistic Regression Model
Logistic regression model equation:
P(y=1 | X) = 1 / (1 + e^(-z)), where z = β₀ + β₁·x₁ + β₂·x₂ + β₃·x₃ + β₄·x₄
Assumed trained coefficients: β₀ = -12, β₁ = 1.5, β₂ = 0.1, β₃ = 0.2, β₄ = -0.05
Step 2: Predict for Sample 1
Sample 1: x₁=4, x₂=90, x₃=7, x₄=2
Substitute into z:
z = -12 + 1.5·4 + 0.1·90 + 0.2·7 - 0.05·2
z = -12 + 6 + 9 + 1.4 - 0.1 = 4.3
P(y=1|X) = 1 / (1 + e^(-4.3)) ≈ 0.987
Step 3: Feature Importance (First Iteration)
Feature importance (absolute coefficients):
Feature Coefficient β |β|
X1 1.5 1.5
X2 0.1 0.1
X3 0.2 0.2

8
X4 -0.05 0.05
Eliminate X4 (least important |β| = 0.05)
Step 4: Retrain Without X4
New coefficients: β₀ = -10, β₁ = 1.8, β₂ = 0.05, β₃ = 0.1
Importance:
Feature Coefficient β |β|

X1 1.8 1.8

X2 0.05 0.05

X3 0.1 0.1

Eliminate X2 (least important |β| = 0.05)


Step 5: Final Model with X1 and X3
Final coefficients: β₀ = -8, β₁ = 2.0, β₃ = 0.15
Sample 1: x₁ = 4, x₃ = 7
z = -8 + 2·4 + 0.15·7 = -8 + 8 + 1.05 = 1.05
P(y=1|X) = 1 / (1 + e^(-1.05)) ≈ 0.74
Final Selected Features
Features selected by RFE: X1 (Study Hours) and X3 (Sleep Hours)
Advantages of Recursive Feature Elimination (RFE)

No. Advantage Explanation

RFE selects features based on how important they are to a specific model (e.g., SVM,
1. Model-Based Selection
Logistic Regression and Random Forest). This leads to better accuracy.

Effective for High- Works well in domains like bioinformatics, text classification, where there are more
2.
Dimensional Data features than samples.

Recursive and It removes the least important features in a step-by-step, greedy fashion, leading to a well-
3.
Systematic tuned subset.

Integrates with Any Can be used with any algorithm that provides feature importance (like coefficients or
4.
Estimator Gini).

Can Improve Model


5. Reducing irrelevant features helps decrease overfitting and increases generalization.
Performance

6. Gives Feature Ranking Besides selection, RFE also provides a full ranking of all features based on importance.

Disadvantages of Recursive Feature Elimination (RFE)

No. Disadvantage Explanation

Training the model repeatedly for different subsets of features can be very
1. Computationally Expensive
slow, especially with large datasets.

Greedy Approach May Miss Optimal It eliminates features one-by-one without backtracking, which can sometimes
2.
Subset lead to suboptimal selection.

Results vary depending on the estimator (e.g., Logistic Regression may rank
3. Model-Dependent
features differently than Random Forest).

RFE may keep one correlated feature and remove another, even though both
4. Sensitive to Correlated Features
together might improve performance.

Requires Model That Supports Cannot be used with models that don’t expose feature ranking (e.g., k-NN,
5.
Feature Importance naïve Bayes without wrappers).

Recursive process and dependency on intermediate steps make it difficult to


6. Hard to Parallelize
parallelize for speedup.

5 Define Variance Thresholding and explain how does it is used for Robust Features Selection. 6

9
[Link]

[Link]

[Link]

[Link]

[Link]

Definition: Variance Thresholding


Variance Thresholding is a filter-based feature selection method that removes all features whose variance doesn’t
meet a specified threshold.
 It keeps only those features that vary enough across the samples.
 Features with low variance are considered uninformative (e.g., almost the same value in every row), and are
therefore removed.
Mathematical Explanation
For a feature Xj, compute the variance:

If:
Var (Xj) < Threshold
Then feature Xj is removed.
Low-variance features:
 Do not help models distinguish between outputs.
 Add noise and increase model complexity.
 May cause overfitting or slow down training.
Thus, removing them makes feature selection robust and faster.
Illustrative Example (answer accuracy is not required)
Small Dataset (5 samples, 4 features):
Sample X1 X2 X3 X4

1 0 1 5 0

2 0 1 7 0

3 0 1 6 0

4 0 1 8 0

5 0 1 9 0

Compute Variance for each feature:


 Var (X1)= 0 (all values are 0)
 Var (X2)=0 (all values are 1)
 Var (X3)≈2.0
 Var (X4)=0
Apply Threshold = 0.1
Only X3 has variance > 0.1 → So we keep X3, and remove X1, X2, X4.
Final Selected Feature: X3
Advantages of Variance Thresholding

10
No. Advantage Explanation

Simple to Understand and Easy to compute variance and apply a threshold. No complex algorithms
1.
Implement involved.

Does not require the target variable (label). Useful for early data
2. Unsupervised
preprocessing.

3. Fast and Scalable Suitable for high-dimensional data. Just a single pass over each feature.

4. Removes Uninformative Features Eliminates constant or near-constant features that do not help in modeling.

Helps reduce overfitting and improves model generalization by removing


5. Reduces Noise
irrelevant features.

6. Improves Speed Speeds up training and prediction by reducing the number of input features.

Disadvantages of Variance Thresholding


No. Disadvantage Explanation

It does not check whether a feature is actually related to the output class
1. Ignores Target Variable
or regression target.

A low-variance feature may still be important for classification (e.g., a


2. May Remove Useful Features
binary indicator feature).

Choosing a threshold (like 0.1) without domain knowledge can lead to


3. Fixed Threshold is Arbitrary
poor results.

Does Not Handle Categorical Data It works on numerical data only; categorical features must be encoded
4.
Directly first.

5. Does Not Detect Redundancy It does not remove correlated features (e.g., duplicated information).

6 Define Outliers or Anomaly detection. Explain different methods to detect Anomaly 9

[Link]

[Link]

[Link]

[Link]

[Link]

[Link]

Outlier / Anomaly Detection

Definition
An outlier (or anomaly) is a data point that significantly differs from other observations. In machine learning and
statistics, outliers may indicate rare events, data entry errors, or fraud.
LOF
Local Outlier Factor (LOF) compares the local density of a point to the densities of its neighbors.
If the point is in a less dense region than its neighbors, it is considered an outlier.
Idea:
 A point is an outlier if it has a lower density than its neighbors.
 LOF > 1 → point is a potential anomaly
 LOF ≈ 1 → point is normal
Examples:
Bank transaction of ₹10,00,000 from a student account
Sensor reading = 200°C when normal range is 20–80°C
Purpose of Anomaly Detection

11
- Fraud detection
- Fault detection in machines
- Network intrusion detection
- Medical diagnostics
- Data cleaning
Categories of Anomalies
1. Point Anomalies: Single unusual instance
2. Contextual Anomalies: Normal in one context, abnormal in another
3. Collective Anomalies: A group of points that are abnormal together
Methods to Detect Anomalies
1. Statistical Methods
a) Z-Score Method: z = (x - μ) / σ.
Mark x as anomaly if |z| > 3
b) IQR Method: IQR (Interquartile Range) = Q3 - Q1.
Outliers if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR

2. Distance-Based Methods:
Distance-based methods identify an observation as an anomaly if it lies far away from most other points in the dataset.
idea:
Anomalies are the points that are not close to any other point.
Intuition
 Normal data points tend to form dense clusters.
 Outliers appear as isolated points with large distances from others.
Basic Principle
 x be a data point
 dist (x,y) is the distance between x and y
 D(x) is the average distance to its k nearest neighbors
Then:
 If D(x)> threshold, x is marked as an anomaly
a) k-Nearest Neighbors: Points far from k neighbors are anomalies

b) DBSCAN: Unclustered points are considered anomalies


3. Density-Based Methods
Local Outlier Factor (LOF): Compares density around a point with its neighbors. Higher LOF value implies anomaly.
4. Model-Based Methods
a) Isolation Forest: Isolates anomalies quickly through random splits
b) One-Class SVM: Learns a boundary around normal data. Outliers lie outside this boundary.
5. Autoencoder-Based Methods
Train a neural network to reconstruct input. High reconstruction error → anomaly

12
Advantages of Using IQR for Outlier Detection
Benefit Description

Robust to Outliers Not affected by extreme values

Easy to Visualize Commonly used in boxplots

Works for Non-Normal Data No assumption of data distribution

Summary
Method Type Key Idea Use Case
Z-Score/IQR Statistical Deviation from mean/median Simple numerical data
k-NN Distance-Based Far from neighbors General anomalies
LOF Density-Based Sparse density Subtle anomalies
Isolation Forest Model-Based Easy to isolate High-dimensional data
One-Class SVM Model-Based Boundary for normal data Fraud/Rare events
Autoencoders Deep Learning High reconstruction error Complex/sequential data
7 Explain in depth under-sampling, over re-sampling, random sampling and random resampling 12

[Link]

[Link]

[Link]

Sampling Techniques in Machine Learning


1. Under-Sampling
Under-sampling reduces the number of samples in the majority class to match the number in the minority class, creating a
balanced dataset.
Purpose:
To reduce classification bias toward the majority class.
Working:
- Randomly remove samples from the majority class.
- Keep all minority class samples.
Mathematical Representation:
Let M: number of majority class samples
Let m: number of minority class samples
If M > m, then select m samples from the majority class.
New dataset size: Total samples = 2 × m
Example:
Original: Class 0 = 1000, Class 1 = 100
After under-sampling: Class 0 = 100, Class 1 = 100
2. Over-Sampling
Over-sampling increases the minority class size by duplicating or synthetically generating data.
Purpose:
To balance classes without reducing majority class data.
Working:
- Duplicate existing minority class samples
- Or use SMOTE (Synthetic Minority Over-sampling Technique)
Let M: number of majority class samples
Let m: number of minority class samples
If M > m, then generate (M - m) new samples for the minority class.
New dataset size: Total samples = 2 × M
Example:
Original: Class 0 = 1000, Class 1 = 100
After over-sampling: Class 0 = 1000, Class 1 = 1000

13
3. Random Sampling
Randomly selecting a subset of the dataset, irrespective of class labels.
Types:
- With replacement: same sample can appear more than once
- Without replacement: samples are unique
Let N: total samples
Let k: desired number of samples (k < N)
Randomly pick k samples.
New dataset size: Total samples = k
Example:
From 10,000 samples, randomly select 2000.
4. Random Resampling
Randomly resample the dataset using either over-sampling or under-sampling to achieve class balance.
Let M > m: majority class > minority class
Then either:
- Over-sampling: Add (M - m) samples to minority
- Under-sampling: Remove (M - m) samples from majority
Example:
Original: Class 0 = 950, Class 1 = 50
Resampled: Class 0 = 950, Class 1 = 950 (over-sampling) or both = 50 (under-sampling)
Summary Table
Technique Working Pros Cons
Under-Sampling Reduces majority class size Reduces training time Loss of information
Over-Sampling Increases minority class size No data loss Risk of overfitting
Random Sampling Selects random subset Simple and fast May preserve imbalance
Random Resampling Balancing by over/under sampling Improves model balance Overfitting or data loss
8 What is imbalance dataset? What are different Resampling Techniques? Explain any one method in depth 7

[Link]

[Link]

[Link]

Imbalanced Dataset and Resampling Techniques


An imbalanced dataset refers to a classification problem where the distribution of classes is not approximately equal. One
class (majority) has significantly more instances than the other (minority), leading to biased models that favor the majority
class.
Example:
In a fraud detection dataset:
- Class 0 (legitimate): 98,000 samples
- Class 1 (fraudulent): 2,000 samples
Resampling Techniques
Resampling helps balance the dataset by adjusting the number of samples in each class. There are three main types:
1. Under-Sampling
2. Over-Sampling
3. Hybrid Methods
1. Under-Sampling
Reduces the number of samples from the majority class to match the minority class.
Let M be the number of majority samples, and m be the number of minority samples. If M > m, randomly select m
samples from the majority class.
New dataset size = 2 × m
Example:
Original: Class 0 = 1000, Class 1 = 100
After under-sampling: Class 0 = 100, Class 1 = 100

14
2. Over-Sampling
Increases the number of samples in the minority class to match the majority class, either by duplication or synthetic
generation.
Let M be the number of majority samples, and m be the number of minority samples. If M > m, generate (M - m) new
samples for the minority class.
New dataset size = 2 × M
Example:
Original: Class 0 = 1000, Class 1 = 100
After over-sampling: Class 0 = 1000, Class 1 = 1000
SMOTE (Synthetic Minority Over-sampling Technique) – explained in depth
SMOTE generates synthetic examples of the minority class by interpolating between existing samples.

Example:
Assume:
xi = [2, 3], xzi = [4, 5], δ = 0.5
Then:
x_new = [2 + 0.5*(4-2), 3 + 0.5*(5-3)] = [3, 4]

Advantages:
- Prevents overfitting from duplicate data
- Expands minority class feature space
Disadvantages:
- Risk of overlapping classes
- May amplify noise
Comparison Table
Technique Description Pros Cons
Under-Sampling Reduces majority class size Fast, low memory Loss of information
Over-Sampling Duplicates or generates minority samples No data loss Risk of overfitting
SMOTE Generates synthetic samples Better generalization Complex, may add noise
9 Explain a Test on Numerical Data- Distribution of a Sample Mean with examples. (Central Limit Theorem) – CLT 6

[Link]

[Link]

Understanding the distribution of a sample mean is essential in inferential statistics. This concept forms the basis of
hypothesis testing and confidence intervals when analyzing numerical data.
Definition:
The distribution of a sample mean refers to the probability distribution of means calculated from multiple random
samples of the same size (n) drawn from a population.
Let
X₁, X₂, ..., Xₙ
be independent and identically distributed (i.i.d.) random variables from a population with mean μ and variance σ².
Then the sample mean is:
𝑋̄ = (X₁ + X₂ + ... + Xₙ) / n

15
Properties:
1. Expected Value:
E(𝑋̄) = μ
(The expected value of the sample mean equals the population mean)
2. Variance:
Var(𝑋̄) = σ² / n
(The variance of the sample mean decreases with the sample size)
Central Limit Theorem (CLT):
Regardless of the population distribution, as n becomes large (typically n ≥ 30), the sampling distribution of the sample
mean approaches a normal distribution:
𝑋̄ ~ N(μ, σ²/n)
Example:
Suppose the population of weights of apples is normally distributed with:
Mean μ = 150 grams
Standard deviation σ = 30 grams
If we randomly select a sample of n = 25 apples, then the sample mean weight 𝑋̄ is distributed as:
𝑋̄ ~ N(150, (30²)/25) = N(150, 36)
So, the standard deviation of the sample mean (also called the standard error) is:
SE = σ / √n = 30 / √25 = 6 grams
Now suppose we want to find the probability that the sample mean weight is less than 145 grams:
P(𝑋̄ < 145) = P(Z < (145 − 150)/6) = P(Z < −0.8333)
≈ 0.2023 (from standard normal table)
Thus, there is approximately a 20.23% chance that the sample mean weight of 25 apples is less than 145 grams.
Summary Table:

Parameter Formula Description


Sample Mean (𝑋̄) (ΣXᵢ)/n Average of sample data
Expected Value E(𝑋̄) = μ Mean of the sampling distribution
Variance of Sample Mean Var(𝑋̄) = σ² / n Variability of sample mean
Standard Error (SE) σ / √n Standard deviation of the sample mean
CLT Distribution N(μ, σ²/n) Normal approximation (if n is large)

16

You might also like