0% found this document useful (1 vote)

866 views29 pages

Feature Engineering in Machine Learning

The document discusses feature engineering in machine learning. It defines what features and feature engineering are, and describes the main components of feature engineering including feature creation, transformations, extraction, and selection. It also discusses different feature selection techniques and how to choose an appropriate feature selection method based on variable types.

Uploaded by

Soumya Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

866 views29 pages

Feature Engineering in Machine Learning

Uploaded by

Soumya Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

What is Feature?
What is Feature Engineering?
Feature Engineering Processes
Feature Selection

FEATURE ENGINEERING

UNIT III
What is Feature?

In the context of machine learning, a feature (also known as a variable or attribute) is an individual

measurable property or characteristic of a data point that is used as an input for a machine

learning algorithm. Features can be numerical, categorical, or text-based, and they represent different

aspects of the data that are relevant to the problem at hand. For example:

• In a dataset of housing prices, features could include the number of bedrooms, the square footage, the

location, and the age of the property.

• In a dataset of customer demographics, features could include age, gender, income level, and

occupation.
• The choice and quality of features are critical in machine learning, as they can greatly impact the

accuracy and performance of the model.

• In other words we can say that, all machine learning algorithms take input data to generate the

output. The input data remains in a tabular form consisting of rows (instances or observations) and

columns (variable or attributes), and these attributes are often known as features.

• For example, an image is an instance in computer vision, but a line in the image could be the feature.

Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we

can say a feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?

Feature engineering is the process of transforming raw data into features that are suitable for

machine learning models. In other words, it is the process of selecting, extracting, and

transforming the most relevant features from the available data to build more accurate and efficient

machine learning models. It is a machine learning technique that leverages data to create new

variables that aren’t in the training set. It can produce new features for both supervised and

unsupervised learning, with the goal of simplifying and speeding up data transformations while

also enhancing model accuracy.

Regardless of the data or architecture, a terrible feature will have a direct impact on your
model. Feature engineering, in simple terms, is the act of converting raw observations into
desired features using statistical or machine learning approaches. It helps to represent an
underlying problem to predictive models in a better way, which as a result, improve the
accuracy of the model for unseen data. The predictive model contains predictor variables and
an outcome variable while the feature engineering process selects the most useful predictor
variables for the model.
The Feature engineering in ML contains mainly four processes: Feature Creation, Transformations,

Feature Extraction, and Feature Selection. These processes are described as below:

• Feature Creation: Feature creation is finding the most useful variables to be used in a predictive

model. The process is subjective, and it requires human creativity and intervention. The new features are

created by mixing existing features using addition, subtraction, and ration, and these new features have

great flexibility.

• Transformations: The transformation step of feature engineering involves adjusting the predictor

variable to improve the accuracy and performance of the model. For example, it ensures that the model

is flexible to take input of the variety of data; it ensures that all the variables are on the same scale,

making the model easier to understand. It improves the model's accuracy and ensures that all the

features are within the acceptable range to avoid any computational error.
• Feature Extraction: Feature extraction is an automated feature engineering process that generates new

variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so

that it can be easily used and managed for data modelling. Feature extraction methods include cluster analysis,

text analytics, edge detection algorithms, and principal components analysis (PCA).

• Feature Selection: While developing the machine learning model, only a few variables in the dataset are

useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset

with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance

and accuracy of the model. Hence it is very important to identify and select the most appropriate features from

the data and remove the irrelevant or less important features, which is done with the help of feature selection

in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the

original features set by removing the redundant, irrelevant, or noisy features."

Feature Selection

Feature Selection is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. It is an important step in the feature engineering process as it
can have a significant impact on the model’s performance.
Selecting the best features helps the model to perform well. For example, Suppose we want to create a
model that automatically decides which car should be crushed for a spare part, and to do this, we have
a dataset. This dataset contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not decide if the car should
be crushed or not, so we can remove this column and select the rest of the features(column) for the
model building.
Benefits of Feature Selection:

1. Reduces Overfitting: By using only the most relevant features, the model can generalize better to new

data.

2. Improves Model Performance: Selecting the right features can improve the accuracy, precision, and

recall of the model.

3. Decreases Computational Costs: A smaller number of features requires less computation and storage

resources.

4. Improves Interpretability: By reducing the number of features, it is easier to understand and interpret

the results of the model.

Importance of Feature Selection
Feature Selection Techniques

There are mainly two types of Feature Selection techniques, which are:

1. Supervised Feature Selection technique

Supervised Feature selection techniques consider the target variable. These methods are used for

labeled data, and are also used to classify the relevant features for increasing the efficiency of

supervised models, such as classification and regression.

2. Unsupervised Feature Selection technique

Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled

dataset.
There can be various reasons to perform feature selection.

•Simplification of the model.

•Less computational time.

•To avoid the curse of dimensionality.

•Improve the compatibility of data with models.

Roughly the feature selection techniques can be divided into three parts.
There are mainly three techniques under supervised feature Selection:

1. Wrapper Method:-
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively. On the basis of the output of the
model, features are added or subtracted, and with this feature set, the model has trained again.
There are mainly three techniques under supervised feature Selection:

2. Filter Method:-
In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step. The filter method
filters out the irrelevant feature and redundant columns from the model by using different metrics
through ranking. The advantage of using filter methods is that it needs low computational time and
does not over fit the data.
There are mainly three techniques under supervised feature Selection:

3. Embedded Method:-
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar
to the filter method but more accurate than the filter method. These methods are also iterative,
which evaluates each iteration, and optimally finds the most important features that contribute the
most to training in a particular iteration.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
How to choose a Feature Selection Method?
To know this, we need to first identify the type of input and output variables. In machine learning, variables

are of mainly two types:

◦ Numerical Variables: Variable with continuous values such as integer, float.

◦ Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.

Below are some univariate statistical measures, which can be used for filter-based feature selection:

1. Numerical Input, Numerical Output:

Numerical Input variables are used for predictive regression modelling. The common method to be used for

such a case is the Correlation coefficient.

◦ Pearson's correlation coefficient (For linear Correlation).

◦ Spearman's rank coefficient (for non-linear correlation).

2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
◦ ANOVA correlation coefficient (linear).
◦ Kendall's rank coefficient (nonlinear).
3. Categorical Input, Numerical Output:
This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.
4. Categorical Input, Categorical Output:
This is a case of classification predictive modelling with categorical Input variables.
The commonly used technique for such a case is Chi-Squared Test. We can also use Information gain in this
case.
We can summarise the above cases with appropriate
measures in the below table:

Input Variable Output Feature Selection technique

Variable

Numerical Numerical ◦ Pearson's correlation coefficient (For linear

Correlation).
◦ Spearman's rank coefficient (for non-linear
correlation).

Numerical Categorical ◦ ANOVA correlation coefficient (linear).

◦ Kendall's rank coefficient (nonlinear).

Categorical Numerical ◦ Kendall's rank coefficient (linear).

◦ ANOVA correlation coefficient (nonlinear).

Categorical Categorical ◦ Chi-Squared test (contingency tables).

◦ Mutual Information.
Sequential Feature Selection Algorithms
• Sequential feature selection algorithms are basically part of the wrapper methods where it adds and

removes features from the dataset sequentially. Sometimes it evaluates each feature separately and selects

M features from N features on the basis of individual scores; this method is called naive sequential feature

selection. It works very rarely because it does not account for feature dependence.

• Mathematically these algorithms are used for the reduction of initial N features to M features where M<N.

and the M features are optimized for the performance of the model.

• This searching algorithm adds or removes the feature candidate from the candidate subset while evaluating

the objective function or criterion. Sequential searches follow only one direction: either it increases the

number of features in the subset or reduces the number of features in the candidate feature subset.
On the basis of movement, we can divide them into two variants.

Sequential forward selection(SFS)

Sequential Backward Selection (SBS)

Sequential Forward Selection (SFS)
Sequential Backward Selection (SBS)
Bidirectional Feature Selection (BFS)

What is Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an individual
meas

• The choice and quality of features are critical in machine learning, as they can greatly impact the
accuracy and performan

Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In

Regardless of the data or architecture, a terrible feature will have a direct impact on your
model. Feature engineering, in

The Feature engineering in ML contains mainly four processes: Feature Creation, Transformations,
Feature Extraction, and Fea

• Feature Extraction: Feature extraction is an automated feature engineering process that generates new
variables by extract

Feature Selection is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learnin

Distance-Based Classification Methods
50% (2)
Distance-Based Classification Methods
8 pages
Machine Learning Course Overview
100% (1)
Machine Learning Course Overview
88 pages
Unit I: Machine Learning Techniques
No ratings yet
Unit I: Machine Learning Techniques
21 pages
Nearest Neighbor Models in Machine Learning
No ratings yet
Nearest Neighbor Models in Machine Learning
31 pages
Feature Selection Techniques in ML
No ratings yet
Feature Selection Techniques in ML
49 pages
Understanding Estimators in ML
100% (2)
Understanding Estimators in ML
38 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
19 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
43 pages
Machine Learning Unit 1 Overview
No ratings yet
Machine Learning Unit 1 Overview
22 pages
Decision Trees and Bayes Classifier Guide
100% (1)
Decision Trees and Bayes Classifier Guide
6 pages
Robot Localization with HMM Algorithm
No ratings yet
Robot Localization with HMM Algorithm
108 pages
Dimensionality Reduction in Machine Learning
No ratings yet
Dimensionality Reduction in Machine Learning
4 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
111 pages
Overview of Instance-Based Learning
100% (1)
Overview of Instance-Based Learning
49 pages
Naïve Bayes Classifiers Overview
No ratings yet
Naïve Bayes Classifiers Overview
47 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
57 pages
Machine Learning: Concepts and Applications
100% (2)
Machine Learning: Concepts and Applications
9 pages
Learning XOR with Deep Networks
No ratings yet
Learning XOR with Deep Networks
25 pages
Decision Trees: Classification & Regression Guide
100% (1)
Decision Trees: Classification & Regression Guide
37 pages
Understanding Cluster Analysis in Data Mining
No ratings yet
Understanding Cluster Analysis in Data Mining
37 pages
Supervised vs Unsupervised Learning
100% (2)
Supervised vs Unsupervised Learning
7 pages
Introduction to Machine Learning Concepts
100% (9)
Introduction to Machine Learning Concepts
112 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
SVM and Perceptron in Machine Learning
No ratings yet
SVM and Perceptron in Machine Learning
28 pages
JNTUK R20 Machine Learning Notes
No ratings yet
JNTUK R20 Machine Learning Notes
9 pages
Model Selection and Evaluation in ML
No ratings yet
Model Selection and Evaluation in ML
6 pages
Bayes Theorem in Machine Learning
No ratings yet
Bayes Theorem in Machine Learning
37 pages
JNTUH R22 Machine Learning Course Notes
100% (1)
JNTUH R22 Machine Learning Course Notes
33 pages
Machine Learning Question Bank for B.Tech
No ratings yet
Machine Learning Question Bank for B.Tech
29 pages
Probabilistic Reasoning in AI
No ratings yet
Probabilistic Reasoning in AI
10 pages
Principles of Classification in Big Data
100% (1)
Principles of Classification in Big Data
31 pages
Supervised Learning Techniques Overview
No ratings yet
Supervised Learning Techniques Overview
46 pages
Object Detection in Computer Vision
No ratings yet
Object Detection in Computer Vision
14 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
24 pages
ML Unit-2
No ratings yet
ML Unit-2
46 pages
Machine Learning Categories Explained
100% (2)
Machine Learning Categories Explained
12 pages
Machine Learning Question Bank Module
No ratings yet
Machine Learning Question Bank Module
7 pages
Machine Learning Overview and Applications
No ratings yet
Machine Learning Overview and Applications
101 pages
Deep Learning Fundamentals for JNTUK R20
No ratings yet
Deep Learning Fundamentals for JNTUK R20
23 pages
Linear Soft Margin Classifier Overview
100% (1)
Linear Soft Margin Classifier Overview
18 pages
Pattern Recognition: Key Concepts & Questions
No ratings yet
Pattern Recognition: Key Concepts & Questions
23 pages
Dempster-Shafer Theory in AI Explained
100% (1)
Dempster-Shafer Theory in AI Explained
19 pages
Understanding Well-Posed Learning Problems
100% (1)
Understanding Well-Posed Learning Problems
5 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
18 pages
Bayesian Learning and Algorithms Overview
0% (1)
Bayesian Learning and Algorithms Overview
83 pages
Decision Trees in Classification Systems
No ratings yet
Decision Trees in Classification Systems
25 pages
ML Unit-1
No ratings yet
ML Unit-1
32 pages
Comprehensive Data Mining Notes
No ratings yet
Comprehensive Data Mining Notes
37 pages
Understanding Inductive Bias in ML
No ratings yet
Understanding Inductive Bias in ML
3 pages
Guidelines for Machine Learning Experiments
No ratings yet
Guidelines for Machine Learning Experiments
11 pages
Loss Function Optimization in Neural Networks
100% (1)
Loss Function Optimization in Neural Networks
24 pages
Neural Networks and Genetic Algorithms Overview
No ratings yet
Neural Networks and Genetic Algorithms Overview
25 pages
Implementing the FIND-S Algorithm in Python
No ratings yet
Implementing the FIND-S Algorithm in Python
3 pages
Understanding Reinforcement Learning
No ratings yet
Understanding Reinforcement Learning
79 pages
Locally Weighted Regression in ML
No ratings yet
Locally Weighted Regression in ML
13 pages
Feature Engineering in Machine Learning
100% (1)
Feature Engineering in Machine Learning
12 pages
Feature Engg2
No ratings yet
Feature Engg2
12 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
35 pages
Understanding Feature Engineering in ML
No ratings yet
Understanding Feature Engineering in ML
7 pages
Understanding Feature Engineering
No ratings yet
Understanding Feature Engineering
2 pages
Dataiku ML Model Building Guide
No ratings yet
Dataiku ML Model Building Guide
43 pages
STA111 Exam PQ (Edward Cares)
No ratings yet
STA111 Exam PQ (Edward Cares)
35 pages
Data Analysis on Claims and Uber Data
No ratings yet
Data Analysis on Claims and Uber Data
56 pages
Test Bank Business Statistics Communicating With Numbers 5e Sanjiv Jaggia and Alison Kelly
No ratings yet
Test Bank Business Statistics Communicating With Numbers 5e Sanjiv Jaggia and Alison Kelly
7 pages
Grade 12 Practical Research Pre-Test
No ratings yet
Grade 12 Practical Research Pre-Test
3 pages
Statistics For Business Decision Making and Analysis 3rd Edition Stine Ebook & Testbank
No ratings yet
Statistics For Business Decision Making and Analysis 3rd Edition Stine Ebook & Testbank
303 pages
Quality Management and Six Sigma Guide
No ratings yet
Quality Management and Six Sigma Guide
65 pages
Data Management: Gathering & Presentation
No ratings yet
Data Management: Gathering & Presentation
19 pages
Nazerian 2018
No ratings yet
Nazerian 2018
23 pages
Gen Z and Their Sustainable Shopping Behavior in The Second-Hand Clothing Segment: Case Study of The Slovak Republic
No ratings yet
Gen Z and Their Sustainable Shopping Behavior in The Second-Hand Clothing Segment: Case Study of The Slovak Republic
15 pages
Types and Classifications of Variables
No ratings yet
Types and Classifications of Variables
31 pages
Analyzing Variables in Social Surveys
0% (1)
Analyzing Variables in Social Surveys
12 pages
Data Handling in Cambridge Maths
No ratings yet
Data Handling in Cambridge Maths
33 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
445 pages
Practical Research 2: Quarter 1 - Module 3 Variables
No ratings yet
Practical Research 2: Quarter 1 - Module 3 Variables
11 pages
Computing Sample Statistics in Research
No ratings yet
Computing Sample Statistics in Research
32 pages
Defining and Collecting Data Essentials
No ratings yet
Defining and Collecting Data Essentials
59 pages
MBA Workshop on Multivariate Analysis
No ratings yet
MBA Workshop on Multivariate Analysis
113 pages
MMW Chap 4 Data Management Part 1
No ratings yet
MMW Chap 4 Data Management Part 1
58 pages
Family Interventions for Pediatric ABI Efficacy
No ratings yet
Family Interventions for Pediatric ABI Efficacy
31 pages
Fair Price Prediction for Diamonds
No ratings yet
Fair Price Prediction for Diamonds
19 pages
Understanding R Data Classes and Types
No ratings yet
Understanding R Data Classes and Types
45 pages
When to Use Manhattan Distance in Clustering
No ratings yet
When to Use Manhattan Distance in Clustering
183 pages
MMW Statistics and Data Management
No ratings yet
MMW Statistics and Data Management
50 pages
Analyzing Categorical Data in Pandas
No ratings yet
Analyzing Categorical Data in Pandas
5 pages
Understanding Chi-Square Tests
No ratings yet
Understanding Chi-Square Tests
82 pages
Statistics For Absolute Beginners (Second Edition) (Oliver Theobald
50% (4)
Statistics For Absolute Beginners (Second Edition) (Oliver Theobald
144 pages
Factors Influencing Eco-Friendly Pesticide Use
No ratings yet
Factors Influencing Eco-Friendly Pesticide Use
11 pages
Hands-On Data Preprocessing in Python
No ratings yet
Hands-On Data Preprocessing in Python
12 pages
Love Percentage Analysis: Chapter 1
No ratings yet
Love Percentage Analysis: Chapter 1
2 pages