Dimensionality Reduction Techniques Explained
Dimensionality Reduction Techniques Explained
[Link]
[Link]
[Link]
Dimensionality Reduction
Dimensionality reduction is a fundamental process in data analysis and machine learning used to reduce the number of
input variables or features (also called dimensions) in a dataset while retaining as much relevant information as possible.
This process simplifies data, making it more manageable for analysis, visualization, and machine learning model training.
Importance
High-dimensional data challenges: Datasets with many features (sometimes hundreds or thousands) can cause:
o Increased computational complexity
o Overfitting of models
o Difficulty in visualizing patterns
o Poor model generalization
o Redundant or irrelevant features affecting accuracy
Curse of dimensionality: As the number of dimensions increases, the data becomes sparse, and traditional
algorithms struggle to find meaningful patterns or relationships.
Goals
Reduce complexity while maintaining the integrity of the data
Improve model performance and generalization
Prevent overfitting by eliminating irrelevant or noisy features
Enhance visualization of high-dimensional data by reducing it to 2D or 3D
Speed up training of machine learning models by working with fewer variables
Working
There are two main approaches:
1. Feature Selection
Involves identifying and keeping only the most informative features.
Discards features that add little to no value.
Techniques include:
o Filter methods (e.g., correlation, variance threshold)
o Wrapper methods (e.g., recursive feature elimination)
o Embedded methods (e.g., LASSO regularization)
2. Feature Extraction
Creates a new set of features by combining or transforming the original ones.
The new features retain most of the important information.
This often results in fewer but more meaningful variables.
Techniques include:
o Principal Component Analysis (PCA) – unsupervised, maximizes variance
o Linear Discriminant Analysis (LDA) – supervised, maximizes class separability
For example, imagine a dataset with 50 columns describing a car (e.g., weight, speed, color, engine size). Dimensionality
reduction might reduce it to 5 key features (columns) that still capture the main patterns, like performance and size, without
losing the essence of the data.
1
Advantages of Dimensionality Reduction
Fewer features reduce the time and A dataset with 1000 image pixels per sample is
Faster
computational resources needed for training reduced to 50 key features, enabling a face recognition
Computation
and inference. model to train 10× faster.
Emphasizing key variables often boosts model A house price model using only 10 strong predictors
Improved
accuracy by focusing learning on relevant out of 50 gives more accurate results than one trained
Accuracy
information. on all features.
Removing features risks discarding important In a healthcare dataset, aggressive feature reduction
Information Loss
patterns or correlations. may miss subtle symptoms linked to rare diseases.
Parameter Some methods require tuning (e.g., number of Choosing too few principal components can lead to
Sensitivity components), which affects outcomes. underfitting; too many may retain noise.
Initial Some techniques, such as PCA on large Running PCA on a dataset with millions of samples
Computational datasets, require high memory and processing and features may take hours and require powerful
Cost time initially. hardware.
[Link]
[Link]
LDA
[Link]
[Link]
2
[Link]
Dimensionality Reduction Techniques – Overview
Dimensionality reduction techniques transform high-dimensional data into a lower-dimensional form while preserving
important information. This helps in:
Reducing computational cost
Avoiding overfitting
Improving visualization
Enhancing model performance
These methods can be broadly categorized into:
Linear vs. Non-linear Methods
Supervised vs. Unsupervised Methods
1. Principal Component Analysis (PCA)
Type: Linear, Unsupervised
Description: Transforms data into new axes (principal components) that maximize variance. Retains most
important information while reducing features.
Best for: Data compression, noise reduction, visualization.
2. Linear Discriminant Analysis (LDA)
Type: Linear, Supervised
Description: Projects data onto directions that maximize class separability. Uses class labels to find axes that best
discriminate between categories.
Best for: Classification problems and feature extraction.
PCA Working
1. Standardize the Data:
Normalize all features to have zero mean and unit variance to avoid bias due to scale differences (e.g., kg vs. cm).
2. Compute the Covariance Matrix:
Measures how features vary together. It highlights correlated variables that can be combined.
3. Calculate Eigenvalues and Eigenvectors:
o Eigenvectors: The directions (principal components) capturing maximum variance.
o Eigenvalues: Indicate the magnitude of variance captured by each component.
4. Select Top-k Components:
Retain the components that account for the most variance (e.g., 95%).
5. Project the Data:
Transform the original data onto the reduced principal component space.
Mathematical Foundation
Given standardized data matrix X of size
Covariance Matrix:
Use Cases
Image Compression: Reduce pixels/features while retaining image quality.
Noise Reduction: Remove minor fluctuations in biological or sensor data.
Preprocessing: Improve model performance by eliminating irrelevant variables.
Exploratory Analysis: Discover hidden structure in data.
3
Strengths
Computationally Efficient
Preserves Global Variance
Handles Multicollinearity
Unsupervised – no labels required
Limitations
Assumes Linearity
Sensitive to Outliers
Difficult to Interpret Components
Requires Feature Scaling
Example
For a car dataset with features like weight, horsepower, and fuel efficiency, PCA may combine them into one or two
principal components (e.g., "vehicle performance") that capture 95% of the total variation.
LDA is a supervised dimensionality reduction technique used when the dataset contains class labels. It finds linear
combinations of features (called discriminant axes) that maximize class separation while minimizing variance within
classes.
PCA focuses on variance; LDA focuses on class discrimination.
LDA Working
1. Compute Class Means:
For each class, compute the mean feature values.
2. Calculate Scatter Matrices:
o Within-Class Scatter SW: Spread of data within each class.
o Between-Class Scatter SB: Distance between the class means.
3. Find Linear Discriminants:
Maximize the ratio:
Project Data:
4
Sensitive to Class Imbalance
Example
A wine classification dataset with features like acidity, alcohol, and sugar is reduced from 10 to 2 LDA dimensions that
best separate wine types (e.g., red vs. white vs. rosé).
PCA vs. LDA – Comparison Table
PCA LDA
Feature Transformation Orthogonal basis of maximum variance Optimal class separation directions
Visualization Great for structure discovery Great for labeled data visualization
Dimensionality Limit Can reduce to any number ≤ features Can reduce to c−1 dimensions
3 Explain in detail the Chi-square Test for feature selection with the help of a suitable example. 7
[Link]
[Link]
[Link]
5
Observed frequencies (O): How often categories actually occur.
Expected frequencies (E): How often categories would occur if there was no association.
It then calculates how far the observed is from the expected using:
Formula:
Step-by-Step Procedure:
Step 1: Prepare the Data
Ensure features and target are categorical.
If not, apply label encoding or binning.
Step 2: Create a Contingency Table
This is a matrix showing the frequency of feature categories vs target classes.
Example:
Male 3 2 5
Female 2 3 5
Gender Purchase
Male Yes
Female No
Male No
Female Yes
Male Yes
Contingency Table:
Male 2 1 3
Female 1 1 2
Total 3 2 5
6
Expected Frequencies:
[Link]
[Link]
[Link]
Recursive Feature Elimination (RFE) is a feature selection technique used in machine learning to identify the most
important features (or variables) for building a predictive model. It works by recursively removing the least important
features and building the model repeatedly until the desired number of features is reached.
Clarity for “recursively removing”: removing one (or more) features each time, and doing this again and again until only
the most important features remain.
Steps in RFE:
1. Train the model on the current set of features.
2. Rank the features by importance (e.g., coefficients in linear models or feature importance in tree models).
3. Remove the least important feature(s).
4. Repeat the process until the desired number of features is reached.
Recursive Feature Elimination (RFE): Mathematical Steps
Let’s take an example dataset with 5 features:
7
The log-likelihood function is maximized to estimate the coefficients
Calculate new coefficients → rank again → remove the next least important.
Repeat until the desired number of features (e.g., 2 or 3) remains.
Note on Other Models:
If we use Random Forest instead of Logistic Regression:
Feature importance = decrease in Gini impurity or entropy.
Summary of steps:
Train model → estimate coefficients β
Rank features using ∣β∣ (or Gini importance, etc.)
Eliminate lowest one
Repeat with reduced feature set
Example: (answer accuracy not needed)
Sample X1 (Study Hours) X2 (Attendance %) X3 (Sleep Hours) X4 (Social Media Hours) y (Pass=1)
1 4 90 7 2 1
2 1 70 6 5 0
3 3 80 6.5 4 1
4 2 60 5 6 0
5 5 95 8 1 1
Step 1: Logistic Regression Model
Logistic regression model equation:
P(y=1 | X) = 1 / (1 + e^(-z)), where z = β₀ + β₁·x₁ + β₂·x₂ + β₃·x₃ + β₄·x₄
Assumed trained coefficients: β₀ = -12, β₁ = 1.5, β₂ = 0.1, β₃ = 0.2, β₄ = -0.05
Step 2: Predict for Sample 1
Sample 1: x₁=4, x₂=90, x₃=7, x₄=2
Substitute into z:
z = -12 + 1.5·4 + 0.1·90 + 0.2·7 - 0.05·2
z = -12 + 6 + 9 + 1.4 - 0.1 = 4.3
P(y=1|X) = 1 / (1 + e^(-4.3)) ≈ 0.987
Step 3: Feature Importance (First Iteration)
Feature importance (absolute coefficients):
Feature Coefficient β |β|
X1 1.5 1.5
X2 0.1 0.1
X3 0.2 0.2
8
X4 -0.05 0.05
Eliminate X4 (least important |β| = 0.05)
Step 4: Retrain Without X4
New coefficients: β₀ = -10, β₁ = 1.8, β₂ = 0.05, β₃ = 0.1
Importance:
Feature Coefficient β |β|
X1 1.8 1.8
X2 0.05 0.05
X3 0.1 0.1
RFE selects features based on how important they are to a specific model (e.g., SVM,
1. Model-Based Selection
Logistic Regression and Random Forest). This leads to better accuracy.
Effective for High- Works well in domains like bioinformatics, text classification, where there are more
2.
Dimensional Data features than samples.
Recursive and It removes the least important features in a step-by-step, greedy fashion, leading to a well-
3.
Systematic tuned subset.
Integrates with Any Can be used with any algorithm that provides feature importance (like coefficients or
4.
Estimator Gini).
6. Gives Feature Ranking Besides selection, RFE also provides a full ranking of all features based on importance.
Training the model repeatedly for different subsets of features can be very
1. Computationally Expensive
slow, especially with large datasets.
Greedy Approach May Miss Optimal It eliminates features one-by-one without backtracking, which can sometimes
2.
Subset lead to suboptimal selection.
Results vary depending on the estimator (e.g., Logistic Regression may rank
3. Model-Dependent
features differently than Random Forest).
RFE may keep one correlated feature and remove another, even though both
4. Sensitive to Correlated Features
together might improve performance.
Requires Model That Supports Cannot be used with models that don’t expose feature ranking (e.g., k-NN,
5.
Feature Importance naïve Bayes without wrappers).
5 Define Variance Thresholding and explain how does it is used for Robust Features Selection. 6
9
[Link]
[Link]
[Link]
[Link]
[Link]
If:
Var (Xj) < Threshold
Then feature Xj is removed.
Low-variance features:
Do not help models distinguish between outputs.
Add noise and increase model complexity.
May cause overfitting or slow down training.
Thus, removing them makes feature selection robust and faster.
Illustrative Example (answer accuracy is not required)
Small Dataset (5 samples, 4 features):
Sample X1 X2 X3 X4
1 0 1 5 0
2 0 1 7 0
3 0 1 6 0
4 0 1 8 0
5 0 1 9 0
10
No. Advantage Explanation
Simple to Understand and Easy to compute variance and apply a threshold. No complex algorithms
1.
Implement involved.
Does not require the target variable (label). Useful for early data
2. Unsupervised
preprocessing.
3. Fast and Scalable Suitable for high-dimensional data. Just a single pass over each feature.
4. Removes Uninformative Features Eliminates constant or near-constant features that do not help in modeling.
6. Improves Speed Speeds up training and prediction by reducing the number of input features.
It does not check whether a feature is actually related to the output class
1. Ignores Target Variable
or regression target.
Does Not Handle Categorical Data It works on numerical data only; categorical features must be encoded
4.
Directly first.
5. Does Not Detect Redundancy It does not remove correlated features (e.g., duplicated information).
[Link]
[Link]
[Link]
[Link]
[Link]
[Link]
Definition
An outlier (or anomaly) is a data point that significantly differs from other observations. In machine learning and
statistics, outliers may indicate rare events, data entry errors, or fraud.
LOF
Local Outlier Factor (LOF) compares the local density of a point to the densities of its neighbors.
If the point is in a less dense region than its neighbors, it is considered an outlier.
Idea:
A point is an outlier if it has a lower density than its neighbors.
LOF > 1 → point is a potential anomaly
LOF ≈ 1 → point is normal
Examples:
Bank transaction of ₹10,00,000 from a student account
Sensor reading = 200°C when normal range is 20–80°C
Purpose of Anomaly Detection
11
- Fraud detection
- Fault detection in machines
- Network intrusion detection
- Medical diagnostics
- Data cleaning
Categories of Anomalies
1. Point Anomalies: Single unusual instance
2. Contextual Anomalies: Normal in one context, abnormal in another
3. Collective Anomalies: A group of points that are abnormal together
Methods to Detect Anomalies
1. Statistical Methods
a) Z-Score Method: z = (x - μ) / σ.
Mark x as anomaly if |z| > 3
b) IQR Method: IQR (Interquartile Range) = Q3 - Q1.
Outliers if x < Q1 - 1.5×IQR or x > Q3 + 1.5×IQR
2. Distance-Based Methods:
Distance-based methods identify an observation as an anomaly if it lies far away from most other points in the dataset.
idea:
Anomalies are the points that are not close to any other point.
Intuition
Normal data points tend to form dense clusters.
Outliers appear as isolated points with large distances from others.
Basic Principle
x be a data point
dist (x,y) is the distance between x and y
D(x) is the average distance to its k nearest neighbors
Then:
If D(x)> threshold, x is marked as an anomaly
a) k-Nearest Neighbors: Points far from k neighbors are anomalies
12
Advantages of Using IQR for Outlier Detection
Benefit Description
Summary
Method Type Key Idea Use Case
Z-Score/IQR Statistical Deviation from mean/median Simple numerical data
k-NN Distance-Based Far from neighbors General anomalies
LOF Density-Based Sparse density Subtle anomalies
Isolation Forest Model-Based Easy to isolate High-dimensional data
One-Class SVM Model-Based Boundary for normal data Fraud/Rare events
Autoencoders Deep Learning High reconstruction error Complex/sequential data
7 Explain in depth under-sampling, over re-sampling, random sampling and random resampling 12
[Link]
[Link]
[Link]
13
3. Random Sampling
Randomly selecting a subset of the dataset, irrespective of class labels.
Types:
- With replacement: same sample can appear more than once
- Without replacement: samples are unique
Let N: total samples
Let k: desired number of samples (k < N)
Randomly pick k samples.
New dataset size: Total samples = k
Example:
From 10,000 samples, randomly select 2000.
4. Random Resampling
Randomly resample the dataset using either over-sampling or under-sampling to achieve class balance.
Let M > m: majority class > minority class
Then either:
- Over-sampling: Add (M - m) samples to minority
- Under-sampling: Remove (M - m) samples from majority
Example:
Original: Class 0 = 950, Class 1 = 50
Resampled: Class 0 = 950, Class 1 = 950 (over-sampling) or both = 50 (under-sampling)
Summary Table
Technique Working Pros Cons
Under-Sampling Reduces majority class size Reduces training time Loss of information
Over-Sampling Increases minority class size No data loss Risk of overfitting
Random Sampling Selects random subset Simple and fast May preserve imbalance
Random Resampling Balancing by over/under sampling Improves model balance Overfitting or data loss
8 What is imbalance dataset? What are different Resampling Techniques? Explain any one method in depth 7
[Link]
[Link]
[Link]
14
2. Over-Sampling
Increases the number of samples in the minority class to match the majority class, either by duplication or synthetic
generation.
Let M be the number of majority samples, and m be the number of minority samples. If M > m, generate (M - m) new
samples for the minority class.
New dataset size = 2 × M
Example:
Original: Class 0 = 1000, Class 1 = 100
After over-sampling: Class 0 = 1000, Class 1 = 1000
SMOTE (Synthetic Minority Over-sampling Technique) – explained in depth
SMOTE generates synthetic examples of the minority class by interpolating between existing samples.
Example:
Assume:
xi = [2, 3], xzi = [4, 5], δ = 0.5
Then:
x_new = [2 + 0.5*(4-2), 3 + 0.5*(5-3)] = [3, 4]
Advantages:
- Prevents overfitting from duplicate data
- Expands minority class feature space
Disadvantages:
- Risk of overlapping classes
- May amplify noise
Comparison Table
Technique Description Pros Cons
Under-Sampling Reduces majority class size Fast, low memory Loss of information
Over-Sampling Duplicates or generates minority samples No data loss Risk of overfitting
SMOTE Generates synthetic samples Better generalization Complex, may add noise
9 Explain a Test on Numerical Data- Distribution of a Sample Mean with examples. (Central Limit Theorem) – CLT 6
[Link]
[Link]
Understanding the distribution of a sample mean is essential in inferential statistics. This concept forms the basis of
hypothesis testing and confidence intervals when analyzing numerical data.
Definition:
The distribution of a sample mean refers to the probability distribution of means calculated from multiple random
samples of the same size (n) drawn from a population.
Let
X₁, X₂, ..., Xₙ
be independent and identically distributed (i.i.d.) random variables from a population with mean μ and variance σ².
Then the sample mean is:
𝑋̄ = (X₁ + X₂ + ... + Xₙ) / n
15
Properties:
1. Expected Value:
E(𝑋̄) = μ
(The expected value of the sample mean equals the population mean)
2. Variance:
Var(𝑋̄) = σ² / n
(The variance of the sample mean decreases with the sample size)
Central Limit Theorem (CLT):
Regardless of the population distribution, as n becomes large (typically n ≥ 30), the sampling distribution of the sample
mean approaches a normal distribution:
𝑋̄ ~ N(μ, σ²/n)
Example:
Suppose the population of weights of apples is normally distributed with:
Mean μ = 150 grams
Standard deviation σ = 30 grams
If we randomly select a sample of n = 25 apples, then the sample mean weight 𝑋̄ is distributed as:
𝑋̄ ~ N(150, (30²)/25) = N(150, 36)
So, the standard deviation of the sample mean (also called the standard error) is:
SE = σ / √n = 30 / √25 = 6 grams
Now suppose we want to find the probability that the sample mean weight is less than 145 grams:
P(𝑋̄ < 145) = P(Z < (145 − 150)/6) = P(Z < −0.8333)
≈ 0.2023 (from standard normal table)
Thus, there is approximately a 20.23% chance that the sample mean weight of 25 apples is less than 145 grams.
Summary Table:
16