Basics of Feature Engineering
Feature engineering is a critical preparatory process in machine learning.
It is responsible for taking raw input data and converting that to well-aligned features which are ready to be
used by the machine learning models.
Unstructured data is raw, unorganized data which doesn’t follow a specific format or hierarchy.
Typical examples of unstructured data include text data from social networks, e.g. Twitter, Facebook, etc. or
data from server logs, etc.
“Feature engineering refers to the process of translating a data set into
features such that these features are able to represent the data set more
effectively and result in a better learning performance.”
What is a feature?
• A feature is an attribute of a data set that is used in a machine learning process.
• The features in a data set are also called its dimensions.
• So a data set having ‘n’ features is called an n-dimensional data set.
For example consider famous machine learning
data set, Iris, introduced by the British statistician
and biologist Ronald Fisher. It has five attributes or
features namely
[Link], [Link], [Link],
[Link] and Species.
Out of these, the feature ‘Species’ represent the
class variable and the remaining
features are the predictor variables. It is a five- FIG. 4.1 Data set features
dimensional data set.
What is feature engineering?
Feature engineering refers to the process of translating a data set into features such that these features are able to
represent the data set more effectively and result in a better learning performance.
As feature engineering is an important pre-processing step for machine learning. It has two major elements:
feature transformation
feature subset selection
Feature transformation transforms the data – structured or unstructured, into a new set of features which can
represent the underlying problem which machine learning is trying to solve. There are two variants of feature
transformation:
feature construction
feature extraction
Both are sometimes known as feature discovery.
Feature construction process discovers missing information about the relationships between features and augments
the feature space by creating additional features.
Hence, if there are ‘n’ features or dimensions in a data set, after feature construction ‘m’ more features or
dimensions may get added.
So at the end, the data set will become ‘n + m’ dimensional.
Feature extraction is the process of extracting or creating a new set of features from the original set of features
using some functional mapping.
Unlike feature transformation, in case of feature subset selection (or simply feature selection) no new feature is
generated.
The objective of feature selection is to derive a subset of features from the full feature set which is most
meaningful in the context of a specific machine learning problem.
So, essentially the job of feature selection is to derive a subset Fj (F1, F2, …, Fm) of Fi (F1, F2, …, Fn), where m < n,
such that Fj is most meaningful and gets the best result for a machine learning problem.
4.2 FEATURE TRANSFORMATION
Feature transformation is used as an effective tool for dimensionality reduction and hence for boosting learning
model performance. Broadly, there are two distinct goals of feature transformation:
Achieving best reconstruction of the original features in the data set
Achieving highest efficiency in the learning task
4.2.1 Feature construction
Feature construction involves transforming a given set of input features to generate a new set of more
powerful features.
let’s take the example of a real estate data set having details of all apartments sold in a specific region.
The data set has three features – apartment length, apartment breadth, and price of the apartment.
If it is used as an input to a regression problem, such data can be training data for the regression model.
So given the training data, the model should be able to predict the price of an apartment whose price is not
known or which has just come up for sale.
However, instead of using length and breadth of the apartment as a predictor, it is much convenient and
makes more sense to use the area of the apartment, which is not an existing feature of the data set.
So such a feature, namely apartment area, can be added to the data set.
In other words, we transform the three-dimensional data set to a four-dimensional data set, with the newly
‘discovered’ feature apartment area being added to the original data set.
FIG. 4.2 Feature construction
There are certain situations where feature construction is an essential activity before starting with the machine
learning task. These situations are
when features have categorical value and machine learning needs numeric value inputs
when features having numeric (continuous) values and need to be converted to ordinal values
when text-specific feature construction needs to be done
[Link] Encoding categorical (nominal) variables
Let’s take the example of another data set on athletes, as presented in Figure 4.3a.
Say the data set has features age, city of origin, parents athlete (i.e. indicate whether any one of the parents
was an athlete) and Chance of Win.
The feature chance of a win is a class variable while the others are predictor variables.
We know that any machine learning algorithm, whether it’s a classification algorithm (like kNN) or a
regression algorithm, requires numerical figures to learn from. So there are three features – City of origin,
Parents athlete, and Chance of win, which are categorical in nature and cannot be used by any machine
learning task.
FIG. 4.3 Feature construction (encoding nominal variables)
FIG. 4.4 Feature construction (encoding ordinal variables
[Link] Transforming numeric (continuous) features to categorical features
FIG. 4.5 Feature construction (numeric to categorical
[Link] Text-specific feature construction
In the current world, text is arguably the most predominant medium of communication.
Whether we think about social networks like Facebook or micro-blogging channels like Twitter or emails or
short messaging services such as Whatsapp, text plays a major role in the flow of information.
Hence, text mining is an important area of research – not only for technology practitioners but also for industry
practitioners.
However, making sense of text data, due to the inherent unstructured nature of the data, is not so
straightforward.
In the first place, the text data chunks that we can think about do not have readily available features, like
structured data sets, on which machine learning tasks can be executed.
All machine learning models need numerical data as input. So the text data in the data sets need to be
transformed into numerical features.
Text data, or corpus which is the more popular keyword, is converted to a numerical representation following a
process is known as vectorization. In this process, word occurrences in all documents belonging to the corpus are
consolidated in the form of bag-of-words. There are three major steps that are followed:
Tokenize
Count
normalize
In order to tokenize a corpus, the blank spaces and punctuations are used as delimiters to separate out the
words, or tokens.
Then the number of occurrences of each token is counted, for each document.
Lastly, tokens are weighted with reducing importance when they occur in the majority of the documents.
A matrix is then formed with each token representing a column and a specific document of the corpus
representing each row.
Each cell contains the count of occurrence of the token in a specific document. This matrix is known as a
document-term matrix (also known as a term-document matrix).
FIG. 4.6 Feature construction (text-specific
Feature extraction
In feature extraction, new features are created from a combination of original features.
Some of the commonly used operators for combining the original features include
For Boolean features: Conjunctions, Disjunctions, Negation, etc.
For nominal features: Cartesian product, M of N, etc.
For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division, Average, Equivalence,
Inequality, etc.
After feature extraction using a mapping function f (F1, F2, …, Fn) say, we will have a set of features
such that
The most popular feature extraction algorithms used in machine learning:
Principal Component Analysis
Singular value decomposition
Principal Component Analysis
Every data set, has multiple attributes or dimensions – many of which might have similarity with each other. For
example, the height and weight of a person,.
In general, any machine learning algorithm performs better as the number of related attributes or features
reduced.
In other words, a key to the success of machine learning lies in the fact that the features are less in number as
well as the similarity between each other is very less.
This is the main guiding philosophy of principal component analysis (PCA) technique of feature extraction.