DATA MINING
Lectures 4: Data Preprocessing III
Dr. Doaa Elzanfaly
Lecture Outline
◼ Important Concepts
◼ Overfitting
◼ Orthogonal Features
◼ Feature Engineering
◼ Feature Encoding
◼ Feature Creation
◼ Feature Extraction
◼ Feature Selection
2
Feature Engineering
Why Dimensionality Reduction
◼ Curse of Dimensionality
◼ When dimensionality increases, data
becomes increasingly sparse
◼ Density and distance between points,
which is critical to clustering, outlier
analysis, becomes less meaningful.
◼ The possible combinations of subspaces
will grow exponentially
◼ Dimensionality Reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and [Link]
reduce noise
◼ Reduce time and space required in data
mining
◼ Allow easier visualization
Feature Extraction
(Reducing Dimensionality)
▪ Feature extraction involves transforming raw data into a new set of
features, often reducing dimensionality while preserving essential
information.
▪ Techniques:
√ Principal Component Analysis (PCA) - Projects data into fewer
dimensions
√ Independent Component Analysis (ICA) -
√ Linear Discriminant Analysis (LDA) - Maximizes class separability
• Fourier Transform - Converts time-series data into frequency
components
• Autoencoders (Deep Learning) - Learns compressed representations
of high-dimensional data
1. Dimensionality Reduction
Principal Component Analysis (PCA)
◼ Principal component analysis (PCA) is a technique that transforms high-
dimensions data into lower-dimensions while retaining as much information
as possible.
◼ Find a projection that captures the largest amount of variation in data.
◼ The original data are projected onto a much smaller space, resulting in
dimensionality reduction.
The original 3-dimensional data set. Scatterplot after PCA reduced from 3-dimensions
to 2-dimensions
[Link]
Variance is information
Person Alex Ben Chris
Height (cm) 185 145 160
Can you guess who’s who based solely on their
height.
Alex
Chris
Ben
Try this
Person Danial Mark Peter
Height (cm) 171 173 170
Can you guess who’s who ???
Another Round
Person Alex Ben Chris
Height (cm) 185 145 160
Weight (Kg) 69 68 67
If we doubled the number of features, would
that change your guessing strategy?
The weight differences are so small (a.k.a small variance),
it doesn’t help in differentiating persons at all. You still had
to rely mostly on height to make your guesses.
Intuitively, we reduced the data from 2-D to 1-D. The idea is that we
can selectively keep the variables with higher variances and forget
about the variables with lower variance.
Principal Component Analysis (PCA)
◼ What if the two features (height and weight) Given a dataset with 2 features and 4
observations
have the same variance? Does it mean we can
no longer reduce the dimensionality of this data
set?
◼ Instead of limiting ourselves to choose just one
or the other, why not combine them?
◼ PCA combines both height and weight to create
two brand new variables. It could be 30%
height and 70% weight, or 87.2% height and The transformed data (using PC1
13.8% weight, or any other combinations only)
depending on the data that we have.
◼ These two new variables are called the first
principal component (PC1) and the second
principal component (PC2).
Component Analysis (PCA)
◼ Principal components are new
variables that are constructed as
linear combinations or mixtures of
the initial variables.
◼ The new variables are uncorrelated
and most of the information within
the initial variables is squeezed or
compressed into the first
components.
◼ The idea is 10-dimensional data
gives 10 principal components with
the maximum possible information in
the first component, then maximum
remaining information in the second
and so on.
◼ The dimensionality can then be
reduced without losing much
information, by discarding the
components with low information
and considering the remaining
components.
Component Analysis (PCA)
[Link]
▪ The PCA is an effective dimension reduction technique
when predictors are linearly correlated and when the
resulting scores are associated with the response.
▪ However, the orthogonal partitioning of the predictor space
may not provide a good predictive relationship with the
response, especially if the true underlying relationship
between the predictors and the response is non-linear.
• There are many examples where variables are uncorrelated (i.e.,
their correlation coefficient is close to zero) but still have a clear
relationship.
• These examples often involve non-linear relationships that can’t
be captured by the correlation.
▪ PCA requires data to be normalized, as it is sensitive to the
scale of the variables.
Independent variables are always uncorrelated, but uncorrelated variables are not necessarily independent
Independent Component Analysis (ICA)
▪ ICA creates new components that are linear combinations of the original
variables, similar to PCA.
▪ ICA aims to maximize the statistical independence of the component.
▪ This allows ICA to model a broader set of trends compared to PCA, which is
limited to linear relationships.
▪ ICA components are not ordered uniquely (unlike PCA, where components
are ordered by explained variance).
▪ ICA components will differ from PCA components unless the data
exhibits strictly linear trends.
▪ Data is typically normalized and whitened before applying ICA.
• Whitening involves transforming the data into the full set of PCA components, which simplifies
computations without affecting ICA's goals.
Linear Discriminant Analysis (LDA)
▪ Projects data onto axes that maximize the separation between classes (for
supervised learning).
▪ It finds directions that maximize the between-class variance while minimizing
the within-class variance.
▪ The Output are discriminant directions (not necessarily orthogonal).
▪ It assumes that:
▪ Data is normally distributed within each class.
▪ Classes have identical covariance matrices. Linear Discriminant Analysis – Bit by
Bit
▪ Can’t work with unlabelled dataset.
▪ The LDA is sensitive to violations of assumptions (e.g., non-normal data).
PCA Vs. LDA
[Link]
Feature Extraction
◼ Another way to reduce dimensionality of data is to remove
◼ Redundant attributes
◼ Duplicate much or all of the information contained in
one or more other attributes
◼ E.g., purchase price of a product and the amount of sales tax
paid
◼ Irrelevant attributes
◼ Contain no information that is not useful for the data
mining task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting
students' GPA
Feature Engineering
Feature Selection
▪ Some transformations or dimensions reduction methods lead to a new predictor set that
has as many or fewer predictors than the original data.
▪ Other transformations, like Basis Expansions, generate more features than the original
data.
▪ The hope is that some of the newly engineered predictors capture a predictive relationship
with the outcome.
▪ For several models (e.g., SVMs, neural networks), predictive performance is degraded
as the number of uninformative predictors increases.
▪ Therefore, there is a genuine need to appropriately select predictors for modelling.
▪ Techniques:
◼ Intrinsic methods (e.g., Lasso, Decision Trees) select features based on their importance in model fitting.
◼ Filter methods (e.g., correlation, mutual information) rank features based on statistical relevance.
◼ Wrapper methods (e.g., RFE, forward selection) iteratively test feature subsets for predictive power.
Feature Selection
Feature Selection Methodologies
Intrinsic (or implicit) Filter Wrapper
methods Methods Methods.
(e.g., correlation, (e.g., RFE, forward
(e.g., Lasso, Decision Trees)
mutual information) selection)
Have feature
selection naturally Work to couple feature
incorporated with selection approaches with
the modelling modelling techniques.
process.
The most seamless and
Intrinsic Methods important of the three classes
for reducing features.
▪ These models split data based on predictors that create more
Tree- and homogeneous partitions.
Rule-Based ▪ If a predictor is not used in any split, it is effectively excluded.
Models ▪ Ensembles like random forests may force splits on irrelevant predictors,
leading to over-selection.
Multivariate ▪ Instead of fitting a single global equation, MARS divides the data into different
Adaptive regions and fits separate regression equations within each region. This is done
Regression using basis functions (BFs), which help capture non-linearity.
Splines ▪ If a predictor is not included in at least one regression, it is excluded from the
(MARS) model.
▪ These models use penalties to shrink predictor coefficients.
Regularization
▪ The lasso method forces some coefficients to absolute zero, effectively
Models
removing irrelevant predictors.
Lasso Regression (L1 Lasso
(Least Absolute Shrinkage and
Regularization) Selection Operator)
▪ Lasso Regression is a type of linear regression that incorporates L1 regularization, which helps
in both feature selection and shrinking model coefficients.
▪ Lasso modifies the ordinary least squares (OLS) objective function by adding an L1 penalty
term
▪ Adds an L1 penalty: Forces some regression coefficients to be exactly zero, effectively
performing automatic feature selection.
▪ Large λ: More coefficients shrink to zero → simpler model with fewer features
▪ Small λ: Less regularization → more features retained
▪ λ=0: Regular Lasso reduces to ordinary least squares (OLS) regression
▪ Reduces overfitting: Helps improve generalization by preventing overly complex models.
▪ Works well with high-dimensional data: Useful when there are many irrelevant or correlated
predictors.
Coefficients Shrink with large Lambda λ
▪ Small λ: All features retain nonzero coefficients.
▪ Increasing λ: Coefficients shrink towards zero.
▪ Large λ: Some coefficients become exactly zero, effectively removing those features
from the model.
Filter Methods
▪ Filter methods conduct an initial supervised analysis of the predictors to determine
which are important and then only provide these to the model. Some common
examples include:
◼ Correlation-based Feature Selection (CFS) – Selects features with the highest correlation with the
target variable and the lowest correlation with each other.
◼ Chi-Square Test – Measures the independence between categorical features and the target
variable.
◼ Mutual Information – Evaluates how much information a feature contributes to predicting the
target variable.
◼ Variance Thresholding – Removes features with low variance, assuming they provide little
discriminative power.
◼ ANOVA (Analysis of Variance) F-test – Measures the variance between groups to determine
feature importance (for numerical features).
◼ Information Gain – Calculates the reduction in entropy when using a feature for classification.
◼ Principal Component Analysis (PCA) – Although primarily a dimensionality reduction technique, it
can also be used for feature selection by keeping the most important components.
▪ A selection of predictors that meets a filtering criteria like statistical significance may
not be a set that improves predictive performance
Wrapper Methods
▪ A Wrapper Method doesn't just look at the features in isolation.
Instead, it "wraps" itself around a specific model and uses the model's
performance as a guide to select the best subset of features.
▪ It tries out different combinations of features, trains a model on each
combination, and keeps the combination that gives the best model
performance.
▪ Common Search Strategies
• Step-wise forward selection
• Step-wise backward elimination
• Combining forward selection and backward elimination
Wrapper Methods - Stepwise forward selection
◼ The procedure starts with an
empty set of attributes as the
reduced set.
◼ First: The best single-feature is
picked.
◼ Next: At each subsequent
iteration or step, the best of the
remaining original attributes is
added to the set.
Wrapper Methods - Stepwise backward elimination
◼ The procedure starts with the
full set of attributes.
◼ At each step, it removes the
worst attribute remaining in
the set.
Combining forward selection and backward
elimination
◼ The stepwise forward selection and backward elimination
methods can be combined
◼ At each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Hybrid Feature Selection Methods
▪ Hybrid methods combine the strengths of Filter and Wrapper
approaches to achieve better feature selection while balancing
efficiency and accuracy. They generally follow a two-step process:
1. Pre-selection using Filters
◼ The dataset is first filtered using statistical techniques (e.g., correlation, mutual
information, Chi-square test) to remove irrelevant or redundant features.
◼ This step significantly reduces the feature space, making the subsequent steps
computationally efficient.
2. Refinement using Wrappers
◼ The filtered features are then evaluated using Wrapper methods
◼ This ensures that the selected feature subset is optimal for the specific predictive
model.
EXTRA SLIDES
Regression Analysis
◼
Regression Analysis
◼ The model Slope is the best fitting line in the data that achieves the
minimum sum of squares and it starts from the intercept.
Actual Salary
Residual
Predicted Salary
Residual
◼ The method used to determine the best fitting line is called Ordinary
Least Squares OLS. There are other methods, but this is the simplest
one.
Regression Analysis
◼ Linear regression
◼ Where we have only one
independent variable
◼ The fitted line is a straight line since
the model is of 1st order.
◼ Multiple regression
◼ When having several independent
variables that affect the prediction of
the dependent variable
◼ Allows a response variable Y to be
modeled as a linear function of
multidimensional feature vector.