FEATURE ENGINEERING
UNIT III
What is Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an individual
measurable property or characteristic of a data point that is used as an input for a machine
learning algorithm. Features can be numerical, categorical, or text-based, and they represent different
aspects of the data that are relevant to the problem at hand. For example:
• In a dataset of housing prices, features could include the number of bedrooms, the square footage, the
location, and the age of the property.
• In a dataset of customer demographics, features could include age, gender, income level, and
occupation.
• The choice and quality of features are critical in machine learning, as they can greatly impact the
accuracy and performance of the model.
• In other words we can say that, all machine learning algorithms take input data to generate the
output. The input data remains in a tabular form consisting of rows (instances or observations) and
columns (variable or attributes), and these attributes are often known as features.
• For example, an image is an instance in computer vision, but a line in the image could be the feature.
Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we
can say a feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the process of transforming raw data into features that are suitable for
machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and efficient
machine learning models. It is a machine learning technique that leverages data to create new
variables that aren’t in the training set. It can produce new features for both supervised and
unsupervised learning, with the goal of simplifying and speeding up data transformations while
also enhancing model accuracy.
Regardless of the data or architecture, a terrible feature will have a direct impact on your
model. Feature engineering, in simple terms, is the act of converting raw observations into
desired features using statistical or machine learning approaches. It helps to represent an
underlying problem to predictive models in a better way, which as a result, improve the
accuracy of the model for unseen data. The predictive model contains predictor variables and
an outcome variable while the feature engineering process selects the most useful predictor
variables for the model.
The Feature engineering in ML contains mainly four processes: Feature Creation, Transformations,
Feature Extraction, and Feature Selection. These processes are described as below:
• Feature Creation: Feature creation is finding the most useful variables to be used in a predictive
model. The process is subjective, and it requires human creativity and intervention. The new features are
created by mixing existing features using addition, subtraction, and ration, and these new features have
great flexibility.
• Transformations: The transformation step of feature engineering involves adjusting the predictor
variable to improve the accuracy and performance of the model. For example, it ensures that the model
is flexible to take input of the variety of data; it ensures that all the variables are on the same scale,
making the model easier to understand. It improves the model's accuracy and ensures that all the
features are within the acceptable range to avoid any computational error.
• Feature Extraction: Feature extraction is an automated feature engineering process that generates new
variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so
that it can be easily used and managed for data modelling. Feature extraction methods include cluster analysis,
text analytics, edge detection algorithms, and principal components analysis (PCA).
• Feature Selection: While developing the machine learning model, only a few variables in the dataset are
useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset
with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance
and accuracy of the model. Hence it is very important to identify and select the most appropriate features from
the data and remove the irrelevant or less important features, which is done with the help of feature selection
in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the
original features set by removing the redundant, irrelevant, or noisy features."
Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to be
used in a machine-learning model. It is an important step in the feature engineering process as it
can have a significant impact on the model’s performance.
Selecting the best features helps the model to perform well. For example, Suppose we want to create a
model that automatically decides which car should be crushed for a spare part, and to do this, we have
a dataset. This dataset contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the
name of the owner does not contribute to the model performance as it does not decide if the car should
be crushed or not, so we can remove this column and select the rest of the features(column) for the
model building.
Benefits of Feature Selection:
1. Reduces Overfitting: By using only the most relevant features, the model can generalize better to new
data.
2. Improves Model Performance: Selecting the right features can improve the accuracy, precision, and
recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less computation and storage
resources.
4. Improves Interpretability: By reducing the number of features, it is easier to understand and interpret
the results of the model.
Importance of Feature Selection
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
1. Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable. These methods are used for
labeled data, and are also used to classify the relevant features for increasing the efficiency of
supervised models, such as classification and regression.
2. Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled
dataset.
There can be various reasons to perform feature selection.
•Simplification of the model.
•Less computational time.
•To avoid the curse of dimensionality.
•Improve the compatibility of data with models.
Roughly the feature selection techniques can be divided into three parts.
There are mainly three techniques under supervised feature Selection:
1. Wrapper Method:-
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively. On the basis of the output of the
model, features are added or subtracted, and with this feature set, the model has trained again.
There are mainly three techniques under supervised feature Selection:
2. Filter Method:-
In Filter Method, features are selected on the basis of statistics measures. This method does not
depend on the learning algorithm and chooses the features as a pre-processing step. The filter method
filters out the irrelevant feature and redundant columns from the model by using different metrics
through ranking. The advantage of using filter methods is that it needs low computational time and
does not over fit the data.
There are mainly three techniques under supervised feature Selection:
3. Embedded Method:-
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar
to the filter method but more accurate than the filter method. These methods are also iterative,
which evaluates each iteration, and optimally finds the most important features that contribute the
most to training in a particular iteration.
How to choose a Feature Selection Method?
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
How to choose a Feature Selection Method?
To know this, we need to first identify the type of input and output variables. In machine learning, variables
are of mainly two types:
◦ Numerical Variables: Variable with continuous values such as integer, float.
◦ Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-based feature selection:
1. Numerical Input, Numerical Output:
Numerical Input variables are used for predictive regression modelling. The common method to be used for
such a case is the Correlation coefficient.
◦ Pearson's correlation coefficient (For linear Correlation).
◦ Spearman's rank coefficient (for non-linear correlation).
2. Numerical Input, Categorical Output:
Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
◦ ANOVA correlation coefficient (linear).
◦ Kendall's rank coefficient (nonlinear).
3. Categorical Input, Numerical Output:
This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.
4. Categorical Input, Categorical Output:
This is a case of classification predictive modelling with categorical Input variables.
The commonly used technique for such a case is Chi-Squared Test. We can also use Information gain in this
case.
We can summarise the above cases with appropriate
measures in the below table:
Input Variable Output Feature Selection technique
Variable
Numerical Numerical ◦ Pearson's correlation coefficient (For linear
Correlation).
◦ Spearman's rank coefficient (for non-linear
correlation).
Numerical Categorical ◦ ANOVA correlation coefficient (linear).
◦ Kendall's rank coefficient (nonlinear).
Categorical Numerical ◦ Kendall's rank coefficient (linear).
◦ ANOVA correlation coefficient (nonlinear).
Categorical Categorical ◦ Chi-Squared test (contingency tables).
◦ Mutual Information.
Sequential Feature Selection Algorithms
• Sequential feature selection algorithms are basically part of the wrapper methods where it adds and
removes features from the dataset sequentially. Sometimes it evaluates each feature separately and selects
M features from N features on the basis of individual scores; this method is called naive sequential feature
selection. It works very rarely because it does not account for feature dependence.
• Mathematically these algorithms are used for the reduction of initial N features to M features where M<N.
and the M features are optimized for the performance of the model.
• This searching algorithm adds or removes the feature candidate from the candidate subset while evaluating
the objective function or criterion. Sequential searches follow only one direction: either it increases the
number of features in the subset or reduces the number of features in the candidate feature subset.
On the basis of movement, we can divide them into two variants.
Sequential forward selection(SFS)
Sequential Backward Selection (SBS)
Sequential Forward Selection (SFS)
Sequential Backward Selection (SBS)
Bidirectional Feature Selection (BFS)