Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) is the first and foremost step to analyse any kind of data. Rather than a
specific set of procedures, EDA is an approach, or a philosophy, which seeks to explore the most important
and often hidden patterns in a data set. In EDA, we explore the data and try to come up with a hypothesis
about it which we can later test using hypothesis testing. Statisticians use it to take a bird’s eye view of the
data and try to make some sense of it.
In this module, the following topics are covered:
1. Univariate Analysis
• Unordered Categorical Variables
• Ordered Categorical Variables
• Quantitative Variables
• Quantitative Variables - Summary Metrics
2. Segmented Univariate Analysis
• Basis of Segmentation
• Quick way of Segmentation
• Comparison of Averages
• Comparison of Other Metrics
3. Bivariate Analysis
• Bivariate Analysis on Continuous Variables
• Business Problems involving Correlation
• Bivariate Analysis on Categorical Variables
4. Derived Metrics
• Types of Derived Metrics: Type Driven Metrics
• Types of Derived Metrics: Business Driven Metrics
• Types of Derived Metrics: Data Driven Metrics
As the term “univariate” suggests, this session deals with analysing variables one at a time. It is
important to separately understand each variable before moving on to analysing multiple variables
together.
The agenda of univariate analysis is to understand:
• Metadata description
• Data distribution plots
• Summary metrics
Given a data set, the first step is to understand what it contains. Information about a data set can be gained
simply by looking at its metadata. Metadata, in simple terms, is the data that describes each variable in
detail. Information such as the size of the data set, how and when the data set was created, what the rows
and variables represent, etc. are captured in metadata.
Types of Variables
• Ordered categorical variables - Ordered ones have some kind of ordering. Some examples are
Salary = High-Medium-low
Month = Jan-Feb-Mar etc.
• Unordered categorical variables - Unordered ones do not have the notion of high-low, more-
less etc. Example:
Type of loan taken by a person = home, personal, auto etc.
Organisation of a person = Sales, marketing, HR etc.
• Quantitative Variables
Apart from the two types of categorical variables, the other most common type is quantitative variables.
These are simply numeric variables which can be added up, multiplied, divided etc. For example, salary,
number of bank accounts, runs scored by a batsman, the mileage of a car etc.
Distribution plots reveal interesting insights about the data. You can observe various visible patterns
in the plots and try to understand how they came to be.
Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be used
everywhere. Thus, it is important to understand the data and then choose what metric to use to
summarise the data.
Segmented Univariate Analysis
The broad agenda of “Segmented Univariate Analysis” is as follows:
• Basis of segmentation
• Comparison of averages
• Comparison of other metrics
Basis of segmentation
The entire segmentation process can be divided into four parts:
• Take raw data
• Group by dimensions
• Summarise using a relevant metric such as mean, median, etc.
• Compare the aggregated metric across groups/categories
Comparison of Averages
Don’t blindly believe in the averages of the buckets — you need to observe the distribution of each
bucket closely and ask yourself if the difference in means is significant enough to draw a conclusion.
If the difference in means is small, you may not be able to draw inferences. In such cases, a technique
called hypothesis testing is used to ascertain whether the difference in means is significant or due to
‘randomness’. Don’t worry if you do not get the concept of hypothesis correctly, it will be dealt
separately in hypothesis module.
Bivariate Analysis
• Bivariate analysis on continuous variables
Correlation is a metric to find the relationship between the variables . It is a number between -1 and 1
which quantifies the extent to which two variables ‘correlate’ with each other.
• If one increases as the other increases, the correlation is positive
• If one decreases as the other increases, the correlation is negative
• If one stays constant as the other varies, the correlation is zero
In general, a positive correlation means that two variables will increase together and decrease
together, e.g. an increase in rain is accompanied by an increase in humidity. A negative correlation
means that if one variable increases the other decreases, e.g. in some cases, as the price of a
commodity decreases its demand increases.
A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as
one variable moves, either up or down, the other one moves in the same direction. A perfect negative
correlation means that two variables move in opposite directions, while a zero correlation implies no
relationship at all.
• Bivariate analysis on categorical variables
The categorical bivariate analysis is essentially an extension of the segmented univariate analysis to
another categorical variable. In univariate analysis, you compare metrics such as ‘mean of X’ across
various segments of a categorical variable, e.g. mean marks of a student are higher for ‘degree and above’
than other levels of the mother’s education; or the median income of educated parents is higher than
that of uneducated ones, etc.
In the categorical bivariate analysis, you extend this comparison to other categorical variables and
ask —
• Is this true for all categories of another variable, say, men and women? Take another categorical
variable, such as state, and ask
• Is the median income of educated parents higher than that of uneducated ones in all states?
Thus, you are drilling down into another categorical variable and getting closer to the true patterns in
the data. In fact, you may also go to the next level and ask — is the median income of educated parents
higher than that of uneducated ones (variable 1) in all states (variable 2) for all age groups (variable 3)?
This is what you may call ‘trivariate analysis’, and though it gives you a more granular version of the
truth, it gets a bit complex to make sense of and explain to others (and hence it is not usually done in
EDA). Thus, remember that doing only conducting segmented univariate analysis may deceive you into
thinking that a certain phenomenon is true without asking the question — is it true for all sub-
populations or is it true only when you aggregate information across the entire population?
So in general, there are two fundamental aspects of analysing categorical variables:
1. To see the distribution of two categorical variables. For example, if you want to compare the
number of boys and girls who play games, you can make a ‘cross table’ as given below:
From this table, firstly, you can compare boys and girls across a fixed level of ‘play games’, e.g. a
higher number of boys play games every day than girls, a higher number of girls never play games
than boys, etc. And secondly, you can compare the levels of ‘play games’ across a fixed value of
gender, e.g. most boys play every day and very few play once a month or never.
2. To see the distribution of two categorical variables with one continuous variable. For
example, you saw how a student’s percentage in science is distributed based on the father’s
occupation (categorical variable 1) and the poverty level (categorical variable 2).
Derived Metrics
There are three different types of derived metrics:
• Type-driven metrics
• Business-driven metrics
• Data-driven metrics
Type-driven metrics
These metrics can be derived by understanding the variable’s typology. You have already learnt one
simple way of classifying variables/attributes — categorical (ordered, unordered) and quantitative or
numeric. Similarly, there are various other ways of classification, one of which is Steven's typology.
Steven’s typology classifies variables into four types — nominal, ordinal, interval and ratio:
• Nominal variables: Categorical variables, where the categories differ only by their
names; there is no order among categories, e.g. colour (red, blue, green), gender (male,
female), department (HR, analytics, sales). These are the most basic form of categorical
variables.
• Ordinal variables: Categories follow a certain order, but the mathematical difference
between categories is not meaningful, e.g. education level (primary school, high school,
college), height (high, medium, low), performance (bad, good, excellent), etc. Ordinal
variables are nominal as well
• Interval variables: Categories follow a certain order, and the mathematical difference
between categories is meaningful, e.g. temperature in degrees celsius ( the difference between
40 and 30 degrees C is meaningful), dates ( the difference between two dates is the number of
days between them), etc. Interval variables are both nominal and ordinal
• Ratio variables: Apart from the mathematical difference, the ratio
(division/multiplication) is possible, e.g. sales in dollars ($100 is twice $50), marks of
students (50 is half of 100), etc. Ratio variables are nominal, ordinal and interval type
Understanding types of variables enables you to derive new metrics of types different from the same
column. For example, age in years is a ratio attributes, but you can convert it into an ordinal type by
binning it into categories such as children (< 13 years), teenagers (13-19 years), young adults (20-25
years), etc. This enables you to ask questions, e.g. do teenagers do X better than children, are young
adults more likely to do X than the other two types, etc. Here, X is an action you are interested in
finding.
Business Driven Metrics:
They are derived from the existing variables, but it requires domain expertise. Driving metrics from
the business perspective is not an easy task. Without understanding the domain correctly, deriving
insights becomes difficult and prone to errors.
Data Driven Metrics:
Data-driven metrics can be created based on the variables present in the existing data set. For
example, if you have two variables in your data set such as "weight" and "height" which shows a high
correlation. So, instead of analysing "weight" and "height" variables separately, you can think of
deriving a new metric "Body Mass Index (BMI)". Once you get the BMI, you can easily categorise
people based on their fitness, e.g. a BMI below 18.5 should be considered as an underweight
category, while BMI above 30.0 is considered as obese, by standard norms. This is how data- driven
metrics can help you discover hidden patterns out of the data.
Exploratory data analysis helps you, as a data analyst, to look beyond the data. It is a never-ending
process — the more you explore the data, the more insights you get. Almost 80% of the time, you
would spend your time as a data analyst understanding the data and solving various business problems
through EDA. If you understand EDA properly, then half the battle is won.
So far in this module, you have learnt the five most crucial topics for any kind of analysis. They are as
follows:
• Understanding domain
• Understanding data and preparing it for analysis
• Univariate analysis and segmented univariate analysis
• Bivariate analysis
• Deriving new metrics from the existing data
Disclaimer: All content and material on the upGrad website is copyrighted material, either
belonging to upGrad or its bona fide contributors and is purely for the dissemination of education.
You are permitted to access, print, and download extracts from this site purely for your own
education only and on the following basis:
You can download this document from the website for self-use only.
Any copies of this document, in part or full, saved to disk or to any other storage medium,
may only be used for subsequent, self-viewing purposes or to print an individual extract
or copy for non-commercial personal use only.
Any further dissemination, distribution, reproduction or copying of the content of the
document herein or the uploading thereof on other websites, or use of the content for any
other commercial/unauthorized purposes in any way which could infringe the intellectual
property rights of upGrad or its contributors, is strictly prohibited.
No graphics, images, or photographs from any accompanying text in this document will
be used separately for unauthorized purposes.
No material in this document will be modified, adapted, or altered in any way.
No part of this document or upGrad content may be reproduced or stored on any other
website or included in any public or private electronic retrieval system or service without
prior written permission from upGrad Any rights not expressly granted in these terms are
reserved.
© Copyright. upGrad Education Pvt. Ltd. All rights reserved.