0% found this document useful (0 votes)

8 views91 pages

Feature Engineering in Machine Learning

The document discusses feature engineering, emphasizing its importance in machine learning as it involves feature construction, selection, and transformation to enhance model effectiveness. It outlines the processes of transforming structured and unstructured data into meaningful features, addressing issues like high dimensionality and feature relevance. Additionally, it covers techniques such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) for dimensionality reduction and feature extraction.

Uploaded by

mukeshbiknalikar05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views91 pages

Feature Engineering in Machine Learning

Uploaded by

mukeshbiknalikar05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

UNIT – II:

Feature Engineering: Feature, feature engineering, feature

transformation: Feature construction, Feature extraction, Feature
subset selection: issues in high dimensional data, key drivers of
feature selection – feature relevance and redundancy, measures of
feature relevance and redundancy, overall feature selection
process, feature selection approaches.

1
Objectives

 Feature engineering is a critical allied task

that we need to perform to make learning
more effective.

 It has three key components – feature

construction, feature selection, and feature
transformation

2
INTRODUCTION

 Modelling alone doesn’t help us to realize the effectiveness of

machine learning as a problem-solving tool.

 If a specific model is not effective, we can use different levels

to boost the effectiveness

 Before applying machine learning to solve the problems,

there are certain preparatory steps.

 The key aspect which plays a critical role in solving any

machine learning problem is feature engineering.

 It is a critical preparatory step and deals with features of the

data set, which form an important input of any machine
learning problem – be supervised or unsupervised learning.

 It is responsible for taking raw input data and converting that 3

to well-aligned features which are ready to be used by the
machine learning models.
Structured vs Unstructured data

Structured data is highly specific and is stored in a

predefined format

Unstructured data is raw, unorganized data which

doesn’t follow a specific format or hierarchy.
Examples: Text data from social networks, e.g.
Twitter, Facebook, etc. or data from server logs, etc.

4
Feature:

 A feature is an attribute of a data set that is used in a machine

learning process.

 The attributes which are meaningful to machine learning

problem are to be called as features.

 Selection of the subset of features which are meaningful for

machine learning is a sub-area of feature engineering which
draws a lot of research interest.

 The features in a data set are also called its dimensions. So a

data set having ‘n’ features is called an n-dimensional data
set.

Example: Iris dataset

 It has five attributes or features namely [Link], 5

[Link], [Link] and [Link] and Species.

 ‘Species’ feature represent the class variable and the

Iris dataset

6
Feature engineering

 Feature engineering is a process of translating a data

set into features such that these features are able to
represent the data set more effectively and result in a
better learning performance.

It has two major elements:

7
1. Feature transformation

 It transforms the data (structured or unstructured)

into a new set of features

 There are two variants of feature transformation:

a. feature construction
b. feature extraction

Both are sometimes known as feature discovery.

a. Feature construction process discovers missing

information about the relationships between features
and augments the feature space by creating additional
features.

 if there are ‘n’ features or dimensions in a data set, 8

after feature construction ‘m’ more features or
dimensions may get added.
b. Feature extraction is the process of extracting or
creating a new set of features from the original set of
features using some functional mapping.

2. Feature subset selection

 Unlike feature Transformation, in this no new feature is

generated.
 The objective of feature selection is to derive a subset
of features from the full feature set which is most
meaningful in the context of a specific machine
learning problem.

 So, essentially the job of feature selection is to derive

a subset Fj (F1 , F2 , …, Fm ) of Fi (F1 , F2 , …, Fn),
9
where m < n, such that Fj is most meaningful and
gets
4.2 FEATURE TRANSFORMATION

 It is an important prerequisite for the success of any

machine learning model in data preprocessing.

 All available attributes of the dataset are used as

features, and then important features are identified
based on model.

 There will be data with different magnitudes,

 and we have to scale down different features to the
same range of magnitude.

 Most of the algorithms will give more importance to

the features with high volume rather than giving the
same importance to all features

 It leads to wrong predictions and faulty models.

 Incase a model has to be trained to classify a
document as spam or not a spam, a document as
represented as a bag of words.
 Then the features space contains all unique words as
a bag of words occurring across all documents, and
leads to a feature space of a few hundred thousand
features.

 If we start including bigrams or trigrams along with

words, the count of features will run in millions.

(‘n-gram’ is a contiguous set of n items for example

words in a
text block or document used in Natural Language
Processing)
11
 To deal with this problem, Feature transformation is
used as an effective tool for dimensionality reduction
 Broadly, there are two distinct goals of feature
transformation:

 Achieving best reconstruction of the original

features in the data set
 Achieving highest efficiency in the learning task

12
13
Feature construction

Transforms a given set of input features to generate a new

set of more powerful features.

Example:
Data set: Real Estate dataset
Attributes: apartment length, apartment breadth, and
price of the apartment.

It is convenient and makes more sense to use the area of

the apartment instead of length and breadth

14
Situations where feature construction is an essential
activity:

1. when features have categorical value and machine

learning needs numeric value inputs

2. when features having numeric (continuous) values

and need to be converted to ordinal values

3. when text-specific feature construction needs to be

done.

15
1. Encoding categorical (nominal) variables

Athletes dataset:

 Any machine learning algorithm requires numerical

figures to learn from.

 So there are three features - City of origin, Parents

athlete, and Chance of win, which are categorical in
nature and cannot be used by any machine learning
task.
16
 feature construction can be used to create new
dummy features
17
Encoding categorical (ordinal) variables

The grade is an ordinal variable with values A, B, C,

and D

18
2. Transforming numeric (continuous) features to
categorical features

 Sometimes there is a need of transforming a continuous

numerical variable into a categorical variable.

 If we want to convert real estate price prediction

problem, which is a regression problem, as a real estate
price category prediction, which is a classification
problem.

 ‘bin’ the numerical data into multiple categories based

on the data range.

19
20
3. Text-specific feature construction

 In the current world, text is arguably the most

predominant medium of communication.

 Social networks like Facebook or micro-blogging

channels like Twitter or Emails or Short Messaging
Services(SMS) such as Whatsapp, text plays a major
role in the flow of information.

 Hence, Text mining is an important area of research –

not only for technology practitioners but also for
industry practitioners.

 Text data is unstructured and do not have readily

available features(straightforward) like structured data.
21
All ML algorithm need numerical data as input.

 So the text data in the dataset needs to be transformed

Vectorization

 In Text data, or corpus which is the more popular

keyword, is converted into into a numerical (vectors)
representation is known as vectorization.

 In this process, word occurrences in all documents

belonging to the corpus are consolidated in the form of
bag-of-words.

There are three major steps that are followed:

1. tokenize
2. count
3. normalize

22
 First tokenize a corpus, the blank spaces and
punctuations are used as delimiters to separate out
the words, or tokens.

 Then the number of occurrences of each token is

counted, for each document.

 Lastly, tokens are weighted with reducing

importance when they occur in the majority of the
documents.

 A matrix is then formed with each token

representing a column and a specific document of
the corpus representing each row.

 Each cell contains the count of occurrence of the 23

token in a specific document. This matrix is known
as a document-term matrix
24
Feature extraction

 New features are created from a combination of original

features.

 Some of the commonly used operators for combining the

original features include

1. Boolean features: Conjunctions, Disjunctions, Negation,

etc.
2. Nominal features: Cartesian product, M of N, etc.
3. Numerical features: Min, Max, Addition, Subtraction,
Multiplication, Division, Average, Equivalence, Inequality,
etc.

Definition:
 Dataset with a feature set Fi (F1, F2 , …, Fn).
25
 After feature extraction using a mapping function f
 f (F1 , F2 , …, Fn ) then, we will have a set of features
Feat1=0.3 * 34 + 0.9 * 34.5=41.25
Feat2=34.5 + 0.5*23 + 26
0.6*233=185.80
The most popular feature extraction algorithms used in
machine learning:

1. Principal Component Analysis(PCA)

2. Singular value decomposition(SVD)
3. Linear Discriminant Analysis(LDA)

27
Principal Component Analysis(PCA)

 PCA is an unsupervised learning algorithm used for

dimensionality reduction in machine learning.
 Data set have multiple attributes – many of which might
have similarity with each other.
 But machine learning algorithms perform better when
number of features are reduced as well as similarity
between each other is very less.
 PCA is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated
features with the help of orthogonal transformation.

 These new transformed features are called the Principal

Components.

Example:
If the height is more, generally weight is more and vice
28
versa.

So if a data set has height and weight as two of the

 In PCA, a new set of features are extracted
from the original features which are quite
dissimilar in nature.

 So an n dimensional feature space gets

transformed to an m- dimensional feature
space, where the dimensions are
orthogonal(completely independent) to each
other.

29
Basis vector:
 A vector is a quantity having both magnitude and
direction and hence can determine the position of a
point relative to another point in the Euclidean space

 A Vector space is set of vectors

 Vector spaces have a property that they can be

represented as a linear combination of a smaller set
of vectors, called basis vectors.

 So, any vector ‘v’ in a vector space can be

represented as

30
where, a represents ‘n’ scalars and u represents the basis
vectors.
 Orthogonality of vectors in n-dimensional vector space
can be thought of an extension of the vectors being
perpendicular in (i.e completely unrelated or independent
of each other. ) in two-dimensional vector space

 Each vector in the original set can be expressed as a

linear combination of basis vectors

 It helps in decomposing the vectors to a number of

independent components.

31
The objective of PCA is to make the transformation in such
a way that

1. The new features are distinct, i.e. the covariance

between the new features, i.e. the principal components is
0.

2. The principal components are generated in order of the

variability in the data that it captures.

The first principal component should capture the

maximum variability

The second principal component should capture the

next highest variability etc.

3. The sum of variance of the new features or the principal

32
components should be equal to the sum of variance of the
original features.
 PCA works based on a process called eigenvalue
decomposition of a covariance matrix of a data set.
Below are the steps to be followed:

1. Calculate the covariance matrix of a data set.

2. Calculate the eigenvalues of the covariance matrix.
3. The eigenvector having highest eigenvalue represents
the direction in which there is the highest variance.
That is PC1
4. The eigenvector having the next highest eigenvalue
represents the PC2
5. Like this, identify the top ‘k’ eigenvectors having top ‘k’
33
eigenvalues so as to get the ‘k’ principal components.
6. Derive the new dataset
PCA Numerical Example

Using PCA concept to reduce dimension 2 to 1

X Y
4 11
8 4
13 5
7 14
Total=32 Total=34

34
Dr. mohammed Alahmed
35
36
Singular value decomposition

 Singular value decomposition (SVD) is a matrix

factorization technique commonly used in linear algebra.

SVD of a matrix A (m × n) is a factorization of the

form:

where, U and V are orthonormal matrices,

 U is an m × m unitary matrix, V is an n × n unitary

matrix , ∑ is an m × n rectangular diagonal matrix.

 The diagonal entries of ∑ are known as singular values of

matrix A.
37
 The columns of U and V are called the left-singular and
right-singular vectors of matrix A, respectively.
38
Dr. mohammed Alahmed
39
SVD of a data matrix is expected to have the properties
highlighted below:

1. Patterns in the attributes are captured by the right-

singular vectors, i.e. the columns of V.

2. Patterns among the instances are captured by the

left-singular, i.e. the columns of U
3. Larger a singular value, larger is the part of the matrix
A that it accounts for and its associated vectors.

4. New data matrix with ‘k’ attributes is obtained using

the equation

D = D × [v1 , v2 , … , vk ]

Thus, the dimensionality gets reduced to k, SVD is

often used in the context of text data. 40
Problem 1 :
Let A matrix be 1 -1
-2 2
2 2

AT.A
Step 1: find
AT.A = 9 -1
-1 9
Step 2:find eigen values of AT.A.
Characteristic equation of AT.A is AT.A- I=0
9- -1
-1 9- =0
1=10, 2=8 Eigen Values
square of – sum of diagonal elements of A + det of
A. 41
find eigen vectors
(A-
I)X1=0
Now in this Characteristic matrix we
need to submit with diagonal matrix.
i.e 9-10 -1 -1 -1
-1 9-10 = -1 -1

=0
-x-y=0
x=1 y=-1 x1=
42

Step 2:
Find normalized eigen values
For X1(v1)=
V=

Step 3: singular values of A are

σ1 = = 10 σ2 =
Step 6:
Now we will find U
U1=A.V1/ σ1 U2= A.V2/ σ2

43
A.V1= 1 -1
-2 2
2 2 *

U1= U2=

U3 should be orthogonal to both U1 & U2

u3*u1=0
u23*u2=0
Let u3 be

44
U3=
Step 7:
Now combine u1,u2,u3 and write complete U matrix

Similarly

Σ=
VT =

45
Linear Discriminant Analysis

 The objective of LDA is similar to the sense that it

intends to transform a data set into a lower
dimensional feature space.

 However, unlike PCA, the focus of LDA is not to capture

the data set variability.

 Instead, LDA focuses on class separability, i.e.

separating the features based on class separability so
as to avoid over-fitting of the machine learning model.

 PCA calculates eigenvalues of the covariance matrix of

the data set where as LDA calculates eigenvalues and
eigenvectors within a class and inter-class scatter
matrices.
46
Below are the steps to be followed:

1. Calculate the mean vectors for the individual classes.

2. Calculate intra-class and inter-class scatter matrices.
3. Calculate eigenvalues and eigenvectors for S w-1 and SB ,
where Sw-1
is the intra-class scatter matrix and SB is the inter-class
scatter matrix

where, m is the mean vector of the i-th class

47
where, mi is the sample mean for each class, m is the overall
mean of the data set, Ni is the sample size of each class
3 FEATURE SUBSET SELECTION

It selects a subset of system attributes or features which

makes a most meaningful contribution in a machine
learning activity.

Example:
 The student weight data set has features such as
Roll Number, Age, Height, and Weight.

 We can well understand that roll number can have

no bearing, whatsoever, in predicting student
weight.

 So we can eliminate the feature roll number and

build a feature subset to be considered in this
machine learning problem 48

The subset of features is expected to give better results

49
Issues in high-dimensional data

 With the rapid innovations in the digital space, the

volume of data generated has increased to an
unbelievable extent

 At the same time, breakthroughs in the storage

technology area have made storage of large quantity
of data quite cheap.

 This has further motivated the storage and mining of

very large and high-dimensionality data sets.

Examples:

DNA analysis, geographic information systems (GIS),

social networking, etc. 50

 Two new application domains have seen drastic

Issues in high-dimensional data

 The biomedical research includes gene selection from

microarray data.

 It generates data sets having a number of features in

the range of a few tens of thousands

 The text data generated from different sources also

have extremely high dimensions , from social
networking sites, like emails ,messages, article etc.

 In a large document corpus having few thousand

documents embedded, the number of unique word
tokens which represent the feature of the text data set,
can also be in the range of a few tens of thousands.
51
 This high-dimensional data may be a big challenge for
any machine learning algorithm
Issues in high-dimensional data

Problems in high dimensionality data:

 very high quantity of computational resources and

high amount of time will be required.

 The performance of the model – both for supervised

and unsupervised machine learning task, also
degrades sharply due to unnecessary noise in the
data.

 A model built on an extremely high number of

features may be very difficult to understand.

Hence, it is necessary to take a subset of the features

instead of the full set. 52
Issues in high-dimensional data

The objective of feature selection is three-fold:

• Having faster and more cost-effective (i.e. less need

for computational resources) learning model

• Improving the efficiency of the learning model

• Having a better understanding of the underlying

model that generated the data

53
4.3.2 Key drivers of feature selection – feature
relevance and redundancy

Feature relevance :
 In supervised learning, the input dataset which is the
training dataset, has a class label attached

 The model have to assign class labels to new, un-

labelled data.

 Each of the predictor variables, is expected to

contribute information to decide the value of the class
label

 A variable is not contributing any information, it is said

to be irrelevant

 In case the information contribution for prediction is 54

very little, the variable is said to be weakly relevant.

Feature relevance :

 In unsupervised learning, there is no training data set

or labelled data.

 Grouping of similar data instances are done and

similarity of data instances are evaluated based on the
value of different variables.

 Certain variables do not contribute any useful

information for deciding the similarity or dissimilarity
of data instances.

 So, those variables make no significant information

contribution in the grouping process.

 These variables are marked as irrelevant variables in 55

the context of the unsupervised machine learning task.
56
57
Example

 Student data set: To predict the weight of a student ,

Roll number doesn’t contribute any significant
information, in supervised learning

 To group the students with Similar academic

capabilities, Roll number can really not contribute any
information whatsoever.

 The irrelevant candidates are rejected in selecting a

subset of features.

 The weakly relevant features are to be rejected or not,

on a case-to-case basis.

58
Feature Redundancy…

 A feature may contribute information which is similar

to the information contributed by one or more other
features.

Example: In the weight prediction of a student, both the

features Age and Height contribute similar information.

 An increase in Age, Weight is expected to increase.

with the increase of Height also Weight is expected to
increase.
Age and Height increase with each other.

 So, in context of the Weight prediction problem, Age

and Height contribute similar information.
59
 when one feature is similar to another feature, the
feature is said to be potentially redundant in the
 All features having potential redundancy are candidates
for rejection in the final feature subset.

 Only a small number of representative features are

being a part of the final feature subset.

 The objective of feature selection is to remove all

features which are redundant and irrelevant

60
4.3.3 Measures of feature relevance and redundancy

[Link] Measures of feature relevance

[Link] Measures of Feature redundancy

1. Correlation-based measures
2. Distance-based measures, and
3. Other coefficient-based measure

61
Measures of feature relevance

 The feature relevance is based on the amount of

information contributed by a feature

 For supervised learning, mutual information is

considered as a good measure of information
contribution of a feature to decide the value of the class
label.

 Higher the value of mutual information of a feature,

more relevant is that feature. Mutual information can be
calculated as follows:
MI(C, f ) = H(C) + H( f ) - H(C, f )

where, marginal entropy of the class, H(C) =

marginal entropy of the feature ‘x’, H( f ) = 62

K = number of classes, C = class variable,

Weather data for playing
cricket
 In case of unsupervised learning, there is no class
variable.

 In case of unsupervised learning, the entropy of the set

of features without one feature at a time is calculated
for all the features.

 Then, the features are ranked in a descending order of

information gain from a feature and top ‘β’ percentage
(value of ‘β’ is a design parameter of the algorithm) of
features are selected as relevant features.

 The entropy of a feature f is calculated using Shannon’s

formula below:

 is used only for features that take discrete values. 64

For continuous features, it should be replaced by
discretization performed first to estimate probabilities
p(f = x).
[Link] Measures of Feature redundancy

Feature redundancy, is based on similar information

contribution by multiple features.

Three types of measures are

1. Correlation-based measures
2. Distance-based measures, and
3. Other coefficient-based measure

65
1. Correlation-based similarity measure
 Correlation is a measure of linear dependency between
two random variables.

 Pearson’s product moment correlation coefficient is one

of the most popular and accepted measures of
correlation between two random variables.

 For two random feature variables F1 and F2 ,Pearson

correlation coefficient is defined as:

66
1. Correlation-based similarity measure

 Correlation values range between +1 and –1.

 A correlation of 1 (+ / –) indicates perfect correlation,

i.e. the two features having a perfect linear
relationship.

 In case the correlation is 0, then the features seem to

have no linear relationship.

 Generally, for all feature selection problems a

threshold value is adopted to decide whether two
features have adequate similarity or not

67
2. Distance-based similarity measure
 The most common distance measure is the Euclidean
distance, which, between two features F1 and F2 are
calculated as:

where F1 and F2 are features of an n-dimensional data set

 The data set has two features, aptitude (F1) and

communication (F2) under consideration.

 The Euclidean distance between the features has been

calculated using the formula provided above.

68
2. Distance-based similarity measure
Contd..

 A more generalized form of the Euclidean distance is

the Minkowski distance, measured as

 Minkowski distance takes the form of Euclidean

distance(L2 norm) when r = 2.

 At r = 1, it takes the form of Manhattan distance(L1

norm) , as shown below:

69
2. Distance between Binary Vectors

To calculate the distance between binary vectors is the

Hamming distance.

Example: Hamming distance between two vectors

01101011 and 11001001 is 3

71
3. Other similarity measures ..

Jaccard index/coefficient is used as a measure of similarity

between two features.

The Jaccard distance, a measure of dissimilarity between

two features, is complementary of Jaccard index.

For two features having binary values, Jaccard index is

measured as

Where
n11 = number of cases where both the features have value 1
n01 = number of cases where the feature1 has value 0 and feature2
has value 1
n10= number of cases where the feature 1 has value 1 and feature2
has value 0
72
Jaccard distance d = 1 - J
3. Other similarity measures ..

 Let’s consider two features F1 and F2 having values (0,

1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).

 The identification of the values of n11 , n01 and n10 .

 As shown, the cases where both the values are 0 have

been left out without border – as an indication of the
fact that they will be excluded in the calculation of
Jaccard coefficient.

Jaccard coefficient of F1 and F2 ,

73
J=
3. Other similarity measures ..

Simple matching coefficient (SMC) is almost same as

Jaccard coefficient except the fact that it includes a
number of cases where both the features have a value of
0.

where,
n11= number of cases where both the features have value 1
n01= number of cases where the feature 1 has value 0 and
feature 2 has value 1
n10 = number of cases where the feature 1 has value 1 and
feature 2 has value 0
n11 = number of cases where both the features have value 0 74
Quite understandably, the total count of rows, n = n00 +
n01 + n10 + n11 . All values have been included in the
calculation of SMC.

75
Cosine similarity…

 Cosine Similarity which is one of the most popular

measures in text classification and calculated as

 The text corpus needs to be first transformed into

features with a word token being a feature and the
number of times the word occurs in a document comes
as a value in each row.

 The data set is sparse in nature as only a few words do

appear in a document, and hence in a row of the data
set.

 So each row has very few non-zero values. However,

the non-zero values can be anything integer value as
the same word may occur any number of times.

 Also, considering the sparsity of the data set, the 0-0 76

matches (which obviously is going to be pretty high)
need to be ignored.
Cosine similarity…

where, x.y = vector dot product of x and y =

Let’s calculate the cosine similarity of x and y, where

x = (2, 4, 0, 0, 2, 1, 3, 0, 0) and
y = (2, 1, 0, 0, 3, 2, 1, 0, 1).

x.y = 22 + 41 + 00 + 00 + 23 + 12 + 31 + 00 +

0*1
= 19

77
Cosine similarity…

 It actually measures the angle between x and y

vectors.

 Hence, if cosine similarity has a value 1, the angle

between x and y is 0° which means x and y are same
except for the magnitude.

 If cosine similarity is 0, the angle between x and y is

90°.

 Hence, they do not share any similarity (in case of text

data, no term/word is common).

 In the above example, the angle comes to be 43.2°.

78
Two rows in a document-term matrix have values -
(2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3,
1).

Find the cosine similarity

79
4.3.4 Overall feature selection process
Feature selection is the process of selecting a subset of
features in a data set. It consists of four steps:

1. Generation of possible subsets

2. Subset evaluation
3. Stop searching based on some stopping criterion
4. Validation of the result

80
Reasons for Feature selection

1. Simple model
2. Shorter training time
3. Avoid curse of dimensionality
4. reduce overfitting

Dr. mohammed Alahmed

81
Subset generation

 It is a search procedure which ideally should produce

all possible candidate subsets.

 Different approximate search strategies are

employed to find candidate subsets for evaluation

 The search may start with an empty set and keep

adding features - sequential forward selection.

 a search may start with a full set and successively

remove features - backward elimination.

 In certain cases, search start with both ends and add

and remove features simultaneously - bi-directional
selection.
82
 Each candidate subset is then evaluated and
compared with the previous best performing subset
If the new subset performs better, it replaces the previous
one.

This cycle of subset generation and evaluation continues

till a pre-defined stopping criterion is fulfilled.

Some commonly used stopping criteria are

1. the search completes
2. some given bound (e.g. a specified number of
iterations) is reached
3. subsequent addition (or deletion) of the feature is
not producing a better subset
4. a sufficiently good subset (e.g. a subset having
better classification accuracy than the existing
benchmark) is selected

83
Validation

 The selected best subset is validated either against

prior benchmarks or by experiments using real-life or
synthetic but authentic data sets.

 In case of supervised learning, the accuracy of the

learning model may be the performance parameter
considered for validation.

 The accuracy of the model using the subset derived is

compared against the model accuracy of the subset
derived using some other benchmark algorithm.

 In case of unsupervised, the cluster quality may be

the parameter for validation.
84
4.3.5 Feature selection approaches

There are four types of approach for feature

selection:

1. Filter approach
2. Wrapper approach
3. Hybrid approach
4. Embedded approach

85
Filter Approach

 In the filter approach, the feature subset is selected

based on statistical measures done to assess the
merits of the features from the data perspective.

 No learning algorithm is employed to evaluate the

goodness of the feature selected.

 Some of the common statistical tests conducted on

features as a part of filter approach are –

Pearson’s correlation, Information Gain, Fisher

Score, Analysis of Variance (ANOVA), Chi-Square,
etc.

86
wrapper approach

 Identification of best feature subset is done using

the induction algorithm(ML or greedy algorithms) as
a black box.

 The feature selection algorithm searches for a good

feature subset using the induction algorithm itself as
a part of the evaluation function.

 For every candidate subset, the learning model is

trained and the result is evaluated by running the
learning algorithm

 wrapper approach is computationally very

expensive.
87
 The performance is generally superior compared to
filter approach
Example:

 Forward feature selection

 Backward Elimination Method
 Bi-directional

88
Hybrid approach

 Hybrid approach takes the advantage of both filter and

wrapper approaches.

 A typical hybrid algorithm makes use of both the

statistical tests as used in filter approach to decide
the best subsets for a given cardinality

 And a learning algorithm to select the final best

subset among the best subsets across different
cardinalities.

89
Embedded approach

 It is quite similar to wrapper approach, However, the

difference is it performs feature selection(different
combinations) and classification simultaneously.
 Example: Random Forest algorithm

90
91

Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
119 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
15 pages
Introduction to Feature Engineering
No ratings yet
Introduction to Feature Engineering
33 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
23 pages
Feature Engineering Basics in ML
No ratings yet
Feature Engineering Basics in ML
97 pages
Feature Engineering Basics in Machine Learning
No ratings yet
Feature Engineering Basics in Machine Learning
98 pages
Feature Engineering & Dimensionality Reduction
No ratings yet
Feature Engineering & Dimensionality Reduction
44 pages
Feature Engineering Basics for ML
No ratings yet
Feature Engineering Basics for ML
35 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Module 3 Data Science
No ratings yet
Module 3 Data Science
11 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
20 pages
Machine Learning Feature Engineering Guide
No ratings yet
Machine Learning Feature Engineering Guide
11 pages
Understanding Feature Engineering
No ratings yet
Understanding Feature Engineering
2 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
29 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
6 pages
Understanding Feature Engineering
No ratings yet
Understanding Feature Engineering
11 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
64 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
25 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
10 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
41 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
53 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
6 pages
Feature Engineering Essentials for ML
No ratings yet
Feature Engineering Essentials for ML
66 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
20 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
19 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
139 pages
Machine Learning Pipeline & Feature Engineering
No ratings yet
Machine Learning Pipeline & Feature Engineering
35 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
10 pages
Machine Learning Fundamentals Overview
No ratings yet
Machine Learning Fundamentals Overview
24 pages
What Is A Feature Engineering - IBM
No ratings yet
What Is A Feature Engineering - IBM
16 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
7 pages
Feature Enginnering
No ratings yet
Feature Enginnering
4 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
12 pages
Feature Transformation in Machine Learning
No ratings yet
Feature Transformation in Machine Learning
6 pages
Understanding Feature Engineering
No ratings yet
Understanding Feature Engineering
17 pages
Feature Engineering Fundamentals Explained
No ratings yet
Feature Engineering Fundamentals Explained
201 pages
Model Selection & Feature Engineering Guide
No ratings yet
Model Selection & Feature Engineering Guide
9 pages
Lecture Updated
No ratings yet
Lecture Updated
29 pages
Feature Engineering in Machine Learning
100% (1)
Feature Engineering in Machine Learning
12 pages
Feature Engineering & Dimensionality Reduction
No ratings yet
Feature Engineering & Dimensionality Reduction
38 pages
Ai Module 4
No ratings yet
Ai Module 4
66 pages
Feature Engineering Techniques for AI
No ratings yet
Feature Engineering Techniques for AI
12 pages
2 ML TASK Features
No ratings yet
2 ML TASK Features
49 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
35 pages
Machine Learning Fundamentals Guide
No ratings yet
Machine Learning Fundamentals Guide
96 pages
Feature Engg2
No ratings yet
Feature Engg2
12 pages
Lecture4 Data &features
No ratings yet
Lecture4 Data &features
100 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
7 pages
Feature Engineering for Machine Learning
No ratings yet
Feature Engineering for Machine Learning
6 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
69 pages
Machine Learning
No ratings yet
Machine Learning
46 pages
Finalised Feature Engineering Notes
No ratings yet
Finalised Feature Engineering Notes
17 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
18 pages
Feature Engineering in Machine Learning
0% (1)
Feature Engineering in Machine Learning
29 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
31 pages
Feature Engineering
No ratings yet
Feature Engineering
4 pages
OghmaNano in Perovskite Solar Cells
No ratings yet
OghmaNano in Perovskite Solar Cells
7 pages
A52s Disassembly & Reassembly
No ratings yet
A52s Disassembly & Reassembly
9 pages
Social Media Analytics Exam Paper
No ratings yet
Social Media Analytics Exam Paper
4 pages
Types of Concrete Floor Systems
No ratings yet
Types of Concrete Floor Systems
25 pages
GIPL Job Openings in IT Sector
No ratings yet
GIPL Job Openings in IT Sector
3 pages
IoT-Based Motion Detection System
No ratings yet
IoT-Based Motion Detection System
6 pages
mt6359 FOR 5G
No ratings yet
mt6359 FOR 5G
118 pages
APM Case Study: HiLabs AI Solutions
No ratings yet
APM Case Study: HiLabs AI Solutions
3 pages
Innovative Ideas for Renewable Energy
No ratings yet
Innovative Ideas for Renewable Energy
1 page
AI-Based Facial Emotion Detection
100% (1)
AI-Based Facial Emotion Detection
6 pages
How to Pay Bills with TD Bank Online
No ratings yet
How to Pay Bills with TD Bank Online
8 pages
AD7028 Industrial IP Modem User Manual
No ratings yet
AD7028 Industrial IP Modem User Manual
13 pages
Eigenvalues and Vector Spaces in Linear Algebra
No ratings yet
Eigenvalues and Vector Spaces in Linear Algebra
7 pages
Gmail API Integration for Automation
No ratings yet
Gmail API Integration for Automation
32 pages
UAV Development Through the Decades
No ratings yet
UAV Development Through the Decades
7 pages
Variational Autoencoder Architecture Explained
No ratings yet
Variational Autoencoder Architecture Explained
55 pages
Google Nano Banana Pro: UI/UX Design Review
No ratings yet
Google Nano Banana Pro: UI/UX Design Review
8 pages
C++ Midterm Exam: Vector & Set Classes
No ratings yet
C++ Midterm Exam: Vector & Set Classes
8 pages
AutoCAD Plotting: 5 Essential Steps
No ratings yet
AutoCAD Plotting: 5 Essential Steps
5 pages
1150-41-004 - R0 - 040920 - Technical Specification Electrical Installation
No ratings yet
1150-41-004 - R0 - 040920 - Technical Specification Electrical Installation
77 pages
Android App Development Basics
100% (1)
Android App Development Basics
61 pages
CSC317 Simulation & Modeling Overview
No ratings yet
CSC317 Simulation & Modeling Overview
73 pages
Types and Tests of Chain Blocks
No ratings yet
Types and Tests of Chain Blocks
2 pages
SunFish HR Solutions for Future Work
No ratings yet
SunFish HR Solutions for Future Work
24 pages
Ieee 1588 Precision Time Protocol Design Guide
No ratings yet
Ieee 1588 Precision Time Protocol Design Guide
24 pages
MarkVIe I/O Monitoring and Configuration Guide
No ratings yet
MarkVIe I/O Monitoring and Configuration Guide
25 pages
Aseptic Risk Assessment Simplified
No ratings yet
Aseptic Risk Assessment Simplified
3 pages
! Tongyu Catalog 24-06-2013
No ratings yet
! Tongyu Catalog 24-06-2013
984 pages
Interior Designer Position at Hippo Stores
No ratings yet
Interior Designer Position at Hippo Stores
2 pages
B.Tech Computer Engineering Exam Scheme 2025
No ratings yet
B.Tech Computer Engineering Exam Scheme 2025
57 pages