0% found this document useful (0 votes)

6 views32 pages

Data Preprocessing Techniques in Mining

Uploaded by

hanyawael2410

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views32 pages

Data Preprocessing Techniques in Mining

Uploaded by

hanyawael2410

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA MINING

Lectures 4: Data Preprocessing III

Dr. Doaa Elzanfaly

Lecture Outline

◼ Important Concepts
◼ Overfitting

◼ Orthogonal Features

◼ Feature Engineering
◼ Feature Encoding

◼ Feature Creation

◼ Feature Extraction

◼ Feature Selection

2
Feature Engineering
Why Dimensionality Reduction
◼ Curse of Dimensionality
◼ When dimensionality increases, data
becomes increasingly sparse
◼ Density and distance between points,
which is critical to clustering, outlier
analysis, becomes less meaningful.
◼ The possible combinations of subspaces
will grow exponentially
◼ Dimensionality Reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and [Link]

reduce noise
◼ Reduce time and space required in data
mining
◼ Allow easier visualization
Feature Extraction
(Reducing Dimensionality)
▪ Feature extraction involves transforming raw data into a new set of
features, often reducing dimensionality while preserving essential
information.

▪ Techniques:

√ Principal Component Analysis (PCA) - Projects data into fewer

dimensions

√ Independent Component Analysis (ICA) -

√ Linear Discriminant Analysis (LDA) - Maximizes class separability

• Fourier Transform - Converts time-series data into frequency

components

• Autoencoders (Deep Learning) - Learns compressed representations

of high-dimensional data
1. Dimensionality Reduction
Principal Component Analysis (PCA)
◼ Principal component analysis (PCA) is a technique that transforms high-
dimensions data into lower-dimensions while retaining as much information
as possible.
◼ Find a projection that captures the largest amount of variation in data.
◼ The original data are projected onto a much smaller space, resulting in
dimensionality reduction.

The original 3-dimensional data set. Scatterplot after PCA reduced from 3-dimensions
to 2-dimensions

[Link]
Variance is information

Person Alex Ben Chris

Height (cm) 185 145 160

Can you guess who’s who based solely on their

height.
Alex
Chris
Ben
Try this

Person Danial Mark Peter

Height (cm) 171 173 170

Can you guess who’s who ???

Another Round

Person Alex Ben Chris

Height (cm) 185 145 160
Weight (Kg) 69 68 67

If we doubled the number of features, would

that change your guessing strategy?
The weight differences are so small (a.k.a small variance),
it doesn’t help in differentiating persons at all. You still had
to rely mostly on height to make your guesses.
Intuitively, we reduced the data from 2-D to 1-D. The idea is that we
can selectively keep the variables with higher variances and forget
about the variables with lower variance.
Principal Component Analysis (PCA)
◼ What if the two features (height and weight) Given a dataset with 2 features and 4
observations
have the same variance? Does it mean we can
no longer reduce the dimensionality of this data
set?

◼ Instead of limiting ourselves to choose just one

or the other, why not combine them?

◼ PCA combines both height and weight to create

two brand new variables. It could be 30%
height and 70% weight, or 87.2% height and The transformed data (using PC1
13.8% weight, or any other combinations only)

depending on the data that we have.

◼ These two new variables are called the first

principal component (PC1) and the second
principal component (PC2).
Component Analysis (PCA)
◼ Principal components are new
variables that are constructed as
linear combinations or mixtures of
the initial variables.
◼ The new variables are uncorrelated
and most of the information within
the initial variables is squeezed or
compressed into the first
components.
◼ The idea is 10-dimensional data
gives 10 principal components with
the maximum possible information in
the first component, then maximum
remaining information in the second
and so on.
◼ The dimensionality can then be
reduced without losing much
information, by discarding the
components with low information
and considering the remaining
components.
Component Analysis (PCA)

[Link]
▪ The PCA is an effective dimension reduction technique
when predictors are linearly correlated and when the
resulting scores are associated with the response.

▪ However, the orthogonal partitioning of the predictor space

may not provide a good predictive relationship with the
response, especially if the true underlying relationship
between the predictors and the response is non-linear.

• There are many examples where variables are uncorrelated (i.e.,

their correlation coefficient is close to zero) but still have a clear
relationship.

• These examples often involve non-linear relationships that can’t

be captured by the correlation.

▪ PCA requires data to be normalized, as it is sensitive to the

scale of the variables.

Independent variables are always uncorrelated, but uncorrelated variables are not necessarily independent
Independent Component Analysis (ICA)

▪ ICA creates new components that are linear combinations of the original
variables, similar to PCA.
▪ ICA aims to maximize the statistical independence of the component.
▪ This allows ICA to model a broader set of trends compared to PCA, which is
limited to linear relationships.
▪ ICA components are not ordered uniquely (unlike PCA, where components
are ordered by explained variance).
▪ ICA components will differ from PCA components unless the data
exhibits strictly linear trends.
▪ Data is typically normalized and whitened before applying ICA.
• Whitening involves transforming the data into the full set of PCA components, which simplifies
computations without affecting ICA's goals.
Linear Discriminant Analysis (LDA)

▪ Projects data onto axes that maximize the separation between classes (for
supervised learning).

▪ It finds directions that maximize the between-class variance while minimizing

the within-class variance.

▪ The Output are discriminant directions (not necessarily orthogonal).

▪ It assumes that:

▪ Data is normally distributed within each class.

▪ Classes have identical covariance matrices. Linear Discriminant Analysis – Bit by

Bit

▪ Can’t work with unlabelled dataset.

▪ The LDA is sensitive to violations of assumptions (e.g., non-normal data).

PCA Vs. LDA

[Link]
Feature Extraction

◼ Another way to reduce dimensionality of data is to remove

◼ Redundant attributes
◼ Duplicate much or all of the information contained in
one or more other attributes
◼ E.g., purchase price of a product and the amount of sales tax
paid
◼ Irrelevant attributes
◼ Contain no information that is not useful for the data
mining task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting
students' GPA
Feature Engineering
Feature Selection

▪ Some transformations or dimensions reduction methods lead to a new predictor set that
has as many or fewer predictors than the original data.

▪ Other transformations, like Basis Expansions, generate more features than the original
data.

▪ The hope is that some of the newly engineered predictors capture a predictive relationship
with the outcome.

▪ For several models (e.g., SVMs, neural networks), predictive performance is degraded
as the number of uninformative predictors increases.

▪ Therefore, there is a genuine need to appropriately select predictors for modelling.

▪ Techniques:

◼ Intrinsic methods (e.g., Lasso, Decision Trees) select features based on their importance in model fitting.

◼ Filter methods (e.g., correlation, mutual information) rank features based on statistical relevance.

◼ Wrapper methods (e.g., RFE, forward selection) iteratively test feature subsets for predictive power.
Feature Selection

Feature Selection Methodologies

Intrinsic (or implicit) Filter Wrapper

methods Methods Methods.
(e.g., correlation, (e.g., RFE, forward
(e.g., Lasso, Decision Trees)
mutual information) selection)

Have feature
selection naturally Work to couple feature
incorporated with selection approaches with
the modelling modelling techniques.
process.
The most seamless and
Intrinsic Methods important of the three classes
for reducing features.

▪ These models split data based on predictors that create more

Tree- and homogeneous partitions.
Rule-Based ▪ If a predictor is not used in any split, it is effectively excluded.
Models ▪ Ensembles like random forests may force splits on irrelevant predictors,
leading to over-selection.

Multivariate ▪ Instead of fitting a single global equation, MARS divides the data into different
Adaptive regions and fits separate regression equations within each region. This is done
Regression using basis functions (BFs), which help capture non-linearity.
Splines ▪ If a predictor is not included in at least one regression, it is excluded from the
(MARS) model.

▪ These models use penalties to shrink predictor coefficients.

Regularization
▪ The lasso method forces some coefficients to absolute zero, effectively
Models
removing irrelevant predictors.
Lasso Regression (L1 Lasso
(Least Absolute Shrinkage and
Regularization) Selection Operator)

▪ Lasso Regression is a type of linear regression that incorporates L1 regularization, which helps
in both feature selection and shrinking model coefficients.
▪ Lasso modifies the ordinary least squares (OLS) objective function by adding an L1 penalty
term

▪ Adds an L1 penalty: Forces some regression coefficients to be exactly zero, effectively

performing automatic feature selection.
▪ Large λ: More coefficients shrink to zero → simpler model with fewer features
▪ Small λ: Less regularization → more features retained
▪ λ=0: Regular Lasso reduces to ordinary least squares (OLS) regression
▪ Reduces overfitting: Helps improve generalization by preventing overly complex models.
▪ Works well with high-dimensional data: Useful when there are many irrelevant or correlated
predictors.
Coefficients Shrink with large Lambda λ

▪ Small λ: All features retain nonzero coefficients.

▪ Increasing λ: Coefficients shrink towards zero.
▪ Large λ: Some coefficients become exactly zero, effectively removing those features
from the model.
Filter Methods

▪ Filter methods conduct an initial supervised analysis of the predictors to determine

which are important and then only provide these to the model. Some common
examples include:
◼ Correlation-based Feature Selection (CFS) – Selects features with the highest correlation with the
target variable and the lowest correlation with each other.
◼ Chi-Square Test – Measures the independence between categorical features and the target
variable.
◼ Mutual Information – Evaluates how much information a feature contributes to predicting the
target variable.
◼ Variance Thresholding – Removes features with low variance, assuming they provide little
discriminative power.
◼ ANOVA (Analysis of Variance) F-test – Measures the variance between groups to determine
feature importance (for numerical features).
◼ Information Gain – Calculates the reduction in entropy when using a feature for classification.
◼ Principal Component Analysis (PCA) – Although primarily a dimensionality reduction technique, it
can also be used for feature selection by keeping the most important components.

▪ A selection of predictors that meets a filtering criteria like statistical significance may
not be a set that improves predictive performance
Wrapper Methods
▪ A Wrapper Method doesn't just look at the features in isolation.
Instead, it "wraps" itself around a specific model and uses the model's
performance as a guide to select the best subset of features.

▪ It tries out different combinations of features, trains a model on each

combination, and keeps the combination that gives the best model
performance.

▪ Common Search Strategies

• Step-wise forward selection

• Step-wise backward elimination
• Combining forward selection and backward elimination
Wrapper Methods - Stepwise forward selection

◼ The procedure starts with an

empty set of attributes as the
reduced set.
◼ First: The best single-feature is
picked.
◼ Next: At each subsequent
iteration or step, the best of the
remaining original attributes is
added to the set.
Wrapper Methods - Stepwise backward elimination

◼ The procedure starts with the

full set of attributes.
◼ At each step, it removes the
worst attribute remaining in
the set.
Combining forward selection and backward
elimination

◼ The stepwise forward selection and backward elimination

methods can be combined
◼ At each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Hybrid Feature Selection Methods

▪ Hybrid methods combine the strengths of Filter and Wrapper

approaches to achieve better feature selection while balancing
efficiency and accuracy. They generally follow a two-step process:

1. Pre-selection using Filters

◼ The dataset is first filtered using statistical techniques (e.g., correlation, mutual
information, Chi-square test) to remove irrelevant or redundant features.

◼ This step significantly reduces the feature space, making the subsequent steps
computationally efficient.

2. Refinement using Wrappers

◼ The filtered features are then evaluated using Wrapper methods

◼ This ensures that the selected feature subset is optimal for the specific predictive
model.
EXTRA SLIDES
Regression Analysis
◼
Regression Analysis
◼ The model Slope is the best fitting line in the data that achieves the
minimum sum of squares and it starts from the intercept.
Actual Salary

Residual
Predicted Salary

Residual

◼ The method used to determine the best fitting line is called Ordinary
Least Squares OLS. There are other methods, but this is the simplest
one.
Regression Analysis
◼ Linear regression
◼ Where we have only one
independent variable
◼ The fitted line is a straight line since
the model is of 1st order.

◼ Multiple regression
◼ When having several independent
variables that affect the prediction of
the dependent variable
◼ Allows a response variable Y to be
modeled as a linear function of
multidimensional feature vector.

Feature Selection & Dimensionality Reduction
No ratings yet
Feature Selection & Dimensionality Reduction
4 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
27 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
80 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
30 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
19 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
27 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
16 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
41 pages
Feature Selection vs. Dimensionality Reduction
No ratings yet
Feature Selection vs. Dimensionality Reduction
18 pages
Dimensionality Reduction and Principal Component Analysis
No ratings yet
Dimensionality Reduction and Principal Component Analysis
4 pages
PCA for Dimensionality Reduction
No ratings yet
PCA for Dimensionality Reduction
17 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
21 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
27 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
13 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
84 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
48 pages
Dimensionality Reduction & Clustering Guide
No ratings yet
Dimensionality Reduction & Clustering Guide
78 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
17 pages
Lecture 5 1
No ratings yet
Lecture 5 1
46 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
102 pages
Feature Selection and Dimensionality Reduction
No ratings yet
Feature Selection and Dimensionality Reduction
26 pages
Data Analysis Techniques and Case Studies
No ratings yet
Data Analysis Techniques and Case Studies
17 pages
Dimensionality Reduction Group Project Presentation
No ratings yet
Dimensionality Reduction Group Project Presentation
20 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
33 pages
Unit V
No ratings yet
Unit V
18 pages
Principal Component Analysis Overview
No ratings yet
Principal Component Analysis Overview
19 pages
Feature Reduction & Selection Techniques
No ratings yet
Feature Reduction & Selection Techniques
41 pages
4 - Dimensionality Reduction PCA
No ratings yet
4 - Dimensionality Reduction PCA
19 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
16 pages
PCA for Dimensionality Reduction Guide
No ratings yet
PCA for Dimensionality Reduction Guide
80 pages
Dimensionality Reduction Techniques in ML
No ratings yet
Dimensionality Reduction Techniques in ML
34 pages
ML Unit4 Total
No ratings yet
ML Unit4 Total
64 pages
ML Unit4
No ratings yet
ML Unit4
13 pages
Dimensionality Reduction Techniques Overview
No ratings yet
Dimensionality Reduction Techniques Overview
32 pages
Dimensionality Reduction with PCA
No ratings yet
Dimensionality Reduction with PCA
28 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
34 pages
ML Unit 4
No ratings yet
ML Unit 4
20 pages
Deep Learning Notes III To IV
No ratings yet
Deep Learning Notes III To IV
22 pages
ML Unit - IV
No ratings yet
ML Unit - IV
28 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
15 pages
ML Unit 4
No ratings yet
ML Unit 4
51 pages
Dimensionality Reduction Explained
No ratings yet
Dimensionality Reduction Explained
6 pages
PCA in Machine Learning Overview
No ratings yet
PCA in Machine Learning Overview
63 pages
ML Unit4
No ratings yet
ML Unit4
18 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
100 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
36 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
26 pages
Understanding Dimensionality Reduction Techniques
No ratings yet
Understanding Dimensionality Reduction Techniques
123 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
33 pages
PCA Implementation in Python
No ratings yet
PCA Implementation in Python
18 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
47 pages
Dimensionality Reduction Techniques in ML
No ratings yet
Dimensionality Reduction Techniques in ML
50 pages
Data Reduction and Visualization Techniques
No ratings yet
Data Reduction and Visualization Techniques
22 pages
Dimensionality Reduction Techniques Explained
No ratings yet
Dimensionality Reduction Techniques Explained
45 pages
Curse of Dimensionality in ML Explained
No ratings yet
Curse of Dimensionality in ML Explained
19 pages
Data Reduction Techniques Explained
No ratings yet
Data Reduction Techniques Explained
9 pages
Understanding Principal Component Analysis
No ratings yet
Understanding Principal Component Analysis
8 pages
Vineeth N. Balasubramanian's CV
No ratings yet
Vineeth N. Balasubramanian's CV
8 pages
CS221 Week 2 ML Concepts and Problems
No ratings yet
CS221 Week 2 ML Concepts and Problems
5 pages
Linear Regression in Supervised Learning
No ratings yet
Linear Regression in Supervised Learning
73 pages
Telugu-English Hate Speech Detection
No ratings yet
Telugu-English Hate Speech Detection
54 pages
B.Tech CSE AI Course Structure R23
No ratings yet
B.Tech CSE AI Course Structure R23
208 pages
Zero Trust for Generative AI Security
No ratings yet
Zero Trust for Generative AI Security
3 pages
Federated Learning for Vehicle Anomaly Detection
No ratings yet
Federated Learning for Vehicle Anomaly Detection
12 pages
AI Applications in Healthcare: Analysis & Challenges
100% (3)
AI Applications in Healthcare: Analysis & Challenges
3 pages
Senior Data Scientist in AI & NLP
No ratings yet
Senior Data Scientist in AI & NLP
2 pages
B. Tech Project Report on Internet Addiction
No ratings yet
B. Tech Project Report on Internet Addiction
40 pages
Real-Time Fraud Detection Using VAE and DBSCAN
No ratings yet
Real-Time Fraud Detection Using VAE and DBSCAN
5 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
38 pages
Machine Learning Lab Manual BCSL606
No ratings yet
Machine Learning Lab Manual BCSL606
36 pages
Noise Reduction Techniques for Audio Calls
No ratings yet
Noise Reduction Techniques for Audio Calls
30 pages
GANs: A Decade of Advancements
No ratings yet
GANs: A Decade of Advancements
29 pages
Data Science with R Course Agenda
No ratings yet
Data Science with R Course Agenda
4 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
2 pages
Jayanth Resume
No ratings yet
Jayanth Resume
1 page
IoT-Enhanced Smart Agriculture Survey
No ratings yet
IoT-Enhanced Smart Agriculture Survey
9 pages
8 2024 Scie October Enhancing Heat Transfer 1-s2.0-S2451904924006358-Main
No ratings yet
8 2024 Scie October Enhancing Heat Transfer 1-s2.0-S2451904924006358-Main
11 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
713 pages
AI Tools for Conversational Design
No ratings yet
AI Tools for Conversational Design
23 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
22 pages
Understanding Intelligent Agents in AI
No ratings yet
Understanding Intelligent Agents in AI
12 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
53 pages
JNTUH B.Tech III Year Exam Schedule 2025
No ratings yet
JNTUH B.Tech III Year Exam Schedule 2025
10 pages
JNTUH Consolidated Marks Memo
No ratings yet
JNTUH Consolidated Marks Memo
2 pages
AWS Tools for Real-Time Sensor Data
No ratings yet
AWS Tools for Real-Time Sensor Data
24 pages
AI Model for Job Applicant Screening
No ratings yet
AI Model for Job Applicant Screening
9 pages
CS470: Intro to AI Spring 2024 Syllabus
No ratings yet
CS470: Intro to AI Spring 2024 Syllabus
4 pages

Data Preprocessing Techniques in Mining

Uploaded by

Data Preprocessing Techniques in Mining

Uploaded by

DATA MINING

Lectures 4: Data Preprocessing III

Dr. Doaa Elzanfaly

√ Principal Component Analysis (PCA) - Projects data into fewer

√ Independent Component Analysis (ICA) -

√ Linear Discriminant Analysis (LDA) - Maximizes class separability

• Fourier Transform - Converts time-series data into frequency

• Autoencoders (Deep Learning) - Learns compressed representations

Person Alex Ben Chris

Can you guess who’s who based solely on their

Person Danial Mark Peter

Can you guess who’s who ???

Person Alex Ben Chris

If we doubled the number of features, would

◼ Instead of limiting ourselves to choose just one

◼ PCA combines both height and weight to create

depending on the data that we have.

◼ These two new variables are called the first

▪ However, the orthogonal partitioning of the predictor space

• There are many examples where variables are uncorrelated (i.e.,

• These examples often involve non-linear relationships that can’t

▪ PCA requires data to be normalized, as it is sensitive to the

▪ It finds directions that maximize the between-class variance while minimizing

▪ The Output are discriminant directions (not necessarily orthogonal).

▪ Data is normally distributed within each class.

▪ Classes have identical covariance matrices. Linear Discriminant Analysis – Bit by

▪ Can’t work with unlabelled dataset.

▪ The LDA is sensitive to violations of assumptions (e.g., non-normal data).

◼ Another way to reduce dimensionality of data is to remove

▪ Therefore, there is a genuine need to appropriately select predictors for modelling.

Feature Selection Methodologies

Intrinsic (or implicit) Filter Wrapper

▪ These models split data based on predictors that create more

▪ These models use penalties to shrink predictor coefficients.

▪ Adds an L1 penalty: Forces some regression coefficients to be exactly zero, effectively

▪ Small λ: All features retain nonzero coefficients.

▪ Filter methods conduct an initial supervised analysis of the predictors to determine

▪ It tries out different combinations of features, trains a model on each

▪ Common Search Strategies

• Step-wise forward selection

◼ The procedure starts with an

◼ The procedure starts with the

◼ The stepwise forward selection and backward elimination

▪ Hybrid methods combine the strengths of Filter and Wrapper

1. Pre-selection using Filters

2. Refinement using Wrappers

You might also like