0% found this document useful (0 votes)

6 views6 pages

EDA Notes

Uploaded by

Nishant Bilagi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views6 pages

EDA Notes

Uploaded by

Nishant Bilagi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is the first and foremost step to analyse any kind of data. Rather than a
specific set of procedures, EDA is an approach, or a philosophy, which seeks to explore the most important
and often hidden patterns in a data set. In EDA, we explore the data and try to come up with a hypothesis
about it which we can later test using hypothesis testing. Statisticians use it to take a bird’s eye view of the
data and try to make some sense of it.
In this module, the following topics are covered:
1. Univariate Analysis
• Unordered Categorical Variables
• Ordered Categorical Variables
• Quantitative Variables
• Quantitative Variables - Summary Metrics
2. Segmented Univariate Analysis
• Basis of Segmentation
• Quick way of Segmentation
• Comparison of Averages
• Comparison of Other Metrics
3. Bivariate Analysis
• Bivariate Analysis on Continuous Variables
• Business Problems involving Correlation
• Bivariate Analysis on Categorical Variables
4. Derived Metrics
• Types of Derived Metrics: Type Driven Metrics
• Types of Derived Metrics: Business Driven Metrics
• Types of Derived Metrics: Data Driven Metrics

As the term “univariate” suggests, this session deals with analysing variables one at a time. It is
important to separately understand each variable before moving on to analysing multiple variables
together.

The agenda of univariate analysis is to understand:

• Metadata description
• Data distribution plots
• Summary metrics

Given a data set, the first step is to understand what it contains. Information about a data set can be gained
simply by looking at its metadata. Metadata, in simple terms, is the data that describes each variable in
detail. Information such as the size of the data set, how and when the data set was created, what the rows
and variables represent, etc. are captured in metadata.

Types of Variables
• Ordered categorical variables - Ordered ones have some kind of ordering. Some examples are
Salary = High-Medium-low
Month = Jan-Feb-Mar etc.

• Unordered categorical variables - Unordered ones do not have the notion of high-low, more-
less etc. Example:
Type of loan taken by a person = home, personal, auto etc.
Organisation of a person = Sales, marketing, HR etc.

• Quantitative Variables

Apart from the two types of categorical variables, the other most common type is quantitative variables.
These are simply numeric variables which can be added up, multiplied, divided etc. For example, salary,
number of bank accounts, runs scored by a batsman, the mileage of a car etc.
Distribution plots reveal interesting insights about the data. You can observe various visible patterns
in the plots and try to understand how they came to be.
Summary metrics are used to obtain a quantitative summary of the data. Not all metrics can be used
everywhere. Thus, it is important to understand the data and then choose what metric to use to
summarise the data.

Segmented Univariate Analysis

The broad agenda of “Segmented Univariate Analysis” is as follows:
• Basis of segmentation
• Comparison of averages
• Comparison of other metrics

Basis of segmentation
The entire segmentation process can be divided into four parts:
• Take raw data
• Group by dimensions
• Summarise using a relevant metric such as mean, median, etc.
• Compare the aggregated metric across groups/categories

Comparison of Averages

Don’t blindly believe in the averages of the buckets — you need to observe the distribution of each
bucket closely and ask yourself if the difference in means is significant enough to draw a conclusion.
If the difference in means is small, you may not be able to draw inferences. In such cases, a technique
called hypothesis testing is used to ascertain whether the difference in means is significant or due to
‘randomness’. Don’t worry if you do not get the concept of hypothesis correctly, it will be dealt
separately in hypothesis module.

Bivariate Analysis
• Bivariate analysis on continuous variables

Correlation is a metric to find the relationship between the variables . It is a number between -1 and 1
which quantifies the extent to which two variables ‘correlate’ with each other.
• If one increases as the other increases, the correlation is positive
• If one decreases as the other increases, the correlation is negative
• If one stays constant as the other varies, the correlation is zero

In general, a positive correlation means that two variables will increase together and decrease
together, e.g. an increase in rain is accompanied by an increase in humidity. A negative correlation
means that if one variable increases the other decreases, e.g. in some cases, as the price of a
commodity decreases its demand increases.

A perfect positive correlation means that the correlation coefficient is exactly 1. This implies that as
one variable moves, either up or down, the other one moves in the same direction. A perfect negative
correlation means that two variables move in opposite directions, while a zero correlation implies no
relationship at all.

• Bivariate analysis on categorical variables

The categorical bivariate analysis is essentially an extension of the segmented univariate analysis to
another categorical variable. In univariate analysis, you compare metrics such as ‘mean of X’ across
various segments of a categorical variable, e.g. mean marks of a student are higher for ‘degree and above’
than other levels of the mother’s education; or the median income of educated parents is higher than
that of uneducated ones, etc.

In the categorical bivariate analysis, you extend this comparison to other categorical variables and
ask —
• Is this true for all categories of another variable, say, men and women? Take another categorical
variable, such as state, and ask
• Is the median income of educated parents higher than that of uneducated ones in all states?

Thus, you are drilling down into another categorical variable and getting closer to the true patterns in
the data. In fact, you may also go to the next level and ask — is the median income of educated parents
higher than that of uneducated ones (variable 1) in all states (variable 2) for all age groups (variable 3)?
This is what you may call ‘trivariate analysis’, and though it gives you a more granular version of the
truth, it gets a bit complex to make sense of and explain to others (and hence it is not usually done in
EDA). Thus, remember that doing only conducting segmented univariate analysis may deceive you into
thinking that a certain phenomenon is true without asking the question — is it true for all sub-
populations or is it true only when you aggregate information across the entire population?

So in general, there are two fundamental aspects of analysing categorical variables:

1. To see the distribution of two categorical variables. For example, if you want to compare the
number of boys and girls who play games, you can make a ‘cross table’ as given below:

From this table, firstly, you can compare boys and girls across a fixed level of ‘play games’, e.g. a
higher number of boys play games every day than girls, a higher number of girls never play games
than boys, etc. And secondly, you can compare the levels of ‘play games’ across a fixed value of
gender, e.g. most boys play every day and very few play once a month or never.

2. To see the distribution of two categorical variables with one continuous variable. For
example, you saw how a student’s percentage in science is distributed based on the father’s
occupation (categorical variable 1) and the poverty level (categorical variable 2).

Derived Metrics
There are three different types of derived metrics:
• Type-driven metrics
• Business-driven metrics
• Data-driven metrics

Type-driven metrics

These metrics can be derived by understanding the variable’s typology. You have already learnt one
simple way of classifying variables/attributes — categorical (ordered, unordered) and quantitative or
numeric. Similarly, there are various other ways of classification, one of which is Steven's typology.

Steven’s typology classifies variables into four types — nominal, ordinal, interval and ratio:
• Nominal variables: Categorical variables, where the categories differ only by their
names; there is no order among categories, e.g. colour (red, blue, green), gender (male,
female), department (HR, analytics, sales). These are the most basic form of categorical
variables.
• Ordinal variables: Categories follow a certain order, but the mathematical difference
between categories is not meaningful, e.g. education level (primary school, high school,
college), height (high, medium, low), performance (bad, good, excellent), etc. Ordinal
variables are nominal as well
• Interval variables: Categories follow a certain order, and the mathematical difference
between categories is meaningful, e.g. temperature in degrees celsius ( the difference between
40 and 30 degrees C is meaningful), dates ( the difference between two dates is the number of
days between them), etc. Interval variables are both nominal and ordinal
• Ratio variables: Apart from the mathematical difference, the ratio
(division/multiplication) is possible, e.g. sales in dollars ($100 is twice $50), marks of
students (50 is half of 100), etc. Ratio variables are nominal, ordinal and interval type

Understanding types of variables enables you to derive new metrics of types different from the same
column. For example, age in years is a ratio attributes, but you can convert it into an ordinal type by
binning it into categories such as children (< 13 years), teenagers (13-19 years), young adults (20-25
years), etc. This enables you to ask questions, e.g. do teenagers do X better than children, are young
adults more likely to do X than the other two types, etc. Here, X is an action you are interested in
finding.

Business Driven Metrics:

They are derived from the existing variables, but it requires domain expertise. Driving metrics from
the business perspective is not an easy task. Without understanding the domain correctly, deriving
insights becomes difficult and prone to errors.

Data Driven Metrics:

Data-driven metrics can be created based on the variables present in the existing data set. For
example, if you have two variables in your data set such as "weight" and "height" which shows a high
correlation. So, instead of analysing "weight" and "height" variables separately, you can think of
deriving a new metric "Body Mass Index (BMI)". Once you get the BMI, you can easily categorise
people based on their fitness, e.g. a BMI below 18.5 should be considered as an underweight
category, while BMI above 30.0 is considered as obese, by standard norms. This is how data- driven
metrics can help you discover hidden patterns out of the data.

Exploratory data analysis helps you, as a data analyst, to look beyond the data. It is a never-ending
process — the more you explore the data, the more insights you get. Almost 80% of the time, you
would spend your time as a data analyst understanding the data and solving various business problems
through EDA. If you understand EDA properly, then half the battle is won.
So far in this module, you have learnt the five most crucial topics for any kind of analysis. They are as
follows:
• Understanding domain
• Understanding data and preparing it for analysis
• Univariate analysis and segmented univariate analysis
• Bivariate analysis
• Deriving new metrics from the existing data
Disclaimer: All content and material on the upGrad website is copyrighted material, either
belonging to upGrad or its bona fide contributors and is purely for the dissemination of education.
You are permitted to access, print, and download extracts from this site purely for your own
education only and on the following basis:

 You can download this document from the website for self-use only.
 Any copies of this document, in part or full, saved to disk or to any other storage medium,
may only be used for subsequent, self-viewing purposes or to print an individual extract
or copy for non-commercial personal use only.
 Any further dissemination, distribution, reproduction or copying of the content of the
document herein or the uploading thereof on other websites, or use of the content for any
other commercial/unauthorized purposes in any way which could infringe the intellectual
property rights of upGrad or its contributors, is strictly prohibited.
 No graphics, images, or photographs from any accompanying text in this document will
be used separately for unauthorized purposes.
 No material in this document will be modified, adapted, or altered in any way.
 No part of this document or upGrad content may be reproduced or stored on any other
website or included in any public or private electronic retrieval system or service without
prior written permission from upGrad Any rights not expressly granted in these terms are
reserved.

Understanding Exploratory Data Analysis
100% (1)
Understanding Exploratory Data Analysis
13 pages
Analyzing Quantitative Data Techniques
No ratings yet
Analyzing Quantitative Data Techniques
8 pages
Univariate Analysis of Categorical Data
No ratings yet
Univariate Analysis of Categorical Data
10 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
131 pages
Top EDA Interview Questions for Data Science
No ratings yet
Top EDA Interview Questions for Data Science
9 pages
Quantitative Data Analysis Techniques
No ratings yet
Quantitative Data Analysis Techniques
41 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
13 pages
Data Exploration Techniques Guide
No ratings yet
Data Exploration Techniques Guide
23 pages
Chapter 14 PowerPoint
No ratings yet
Chapter 14 PowerPoint
26 pages
Exploratory Data Analysis in Data Science
100% (3)
Exploratory Data Analysis in Data Science
113 pages
Unit 1-Data Foreseeing
No ratings yet
Unit 1-Data Foreseeing
17 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
18 pages
Module 5
No ratings yet
Module 5
30 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
26 pages
Top EDA Interview Questions for Data Science
No ratings yet
Top EDA Interview Questions for Data Science
9 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
21 pages
Regression Analysis in Research
No ratings yet
Regression Analysis in Research
82 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
6 pages
Essential Data Analysis Techniques
No ratings yet
Essential Data Analysis Techniques
18 pages
Quantitative vs. Qualitative Data Analysis
100% (1)
Quantitative vs. Qualitative Data Analysis
54 pages
Data Analysis and Interpretation Techniques
No ratings yet
Data Analysis and Interpretation Techniques
22 pages
Unit 2 - DA - Statistical Concepts - Student
No ratings yet
Unit 2 - DA - Statistical Concepts - Student
56 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
MDA Notes
No ratings yet
MDA Notes
195 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
25 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
8 pages
Univariate, Bivariate, and Multivariate Analysis
100% (5)
Univariate, Bivariate, and Multivariate Analysis
16 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
54 pages
RM Unit 4
No ratings yet
RM Unit 4
35 pages
Sampling Techniques in Market Research
No ratings yet
Sampling Techniques in Market Research
63 pages
EDA Techniques for Data Analysis
No ratings yet
EDA Techniques for Data Analysis
12 pages
Exploratory Data Analysis in Machine Learning
No ratings yet
Exploratory Data Analysis in Machine Learning
53 pages
Data Exploration Techniques Guide
No ratings yet
Data Exploration Techniques Guide
20 pages
Multivariate Data Analysis Techniques
No ratings yet
Multivariate Data Analysis Techniques
83 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
3 pages
Types of Quantitative Data Analysis
No ratings yet
Types of Quantitative Data Analysis
22 pages
Advanced Data Analysis Techniques
No ratings yet
Advanced Data Analysis Techniques
63 pages
Quantitative Data Analysis Guide
No ratings yet
Quantitative Data Analysis Guide
35 pages
Data Types and Analysis Methods
No ratings yet
Data Types and Analysis Methods
85 pages
Data Analysis Techniques Explained
No ratings yet
Data Analysis Techniques Explained
13 pages
Choosing Measurement Scales in Data Analysis
No ratings yet
Choosing Measurement Scales in Data Analysis
12 pages
EDA vs CDA in Data Analytics
No ratings yet
EDA vs CDA in Data Analytics
79 pages
Business Statistics: Key Concepts & Methods
No ratings yet
Business Statistics: Key Concepts & Methods
49 pages
Statistical Data Analysis Techniques
No ratings yet
Statistical Data Analysis Techniques
34 pages
Quantitative Data Analysis Techniques
No ratings yet
Quantitative Data Analysis Techniques
38 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
13 pages
Data Analysis Methods for Research
No ratings yet
Data Analysis Methods for Research
15 pages
Understanding Quantitative Research Methods
No ratings yet
Understanding Quantitative Research Methods
18 pages
Types of Exploratory Data Analysis
No ratings yet
Types of Exploratory Data Analysis
9 pages
Data Preparation and Statistical Analysis
No ratings yet
Data Preparation and Statistical Analysis
25 pages
Univariate Analysis in Data Preparation
No ratings yet
Univariate Analysis in Data Preparation
33 pages
Univariate Analysis Techniques Explained
No ratings yet
Univariate Analysis Techniques Explained
2 pages
Excluding Missing Values in Analytics
No ratings yet
Excluding Missing Values in Analytics
55 pages
Multivariate Analysis Techniques Explained
No ratings yet
Multivariate Analysis Techniques Explained
50 pages
Univariate, Bivariate, and Multivariate Analysis Techniques
No ratings yet
Univariate, Bivariate, and Multivariate Analysis Techniques
23 pages
Chapter 8 PG
No ratings yet
Chapter 8 PG
21 pages
Data Types and Analysis Methods Explained
No ratings yet
Data Types and Analysis Methods Explained
87 pages
Quantitative Data Analysis CH 10
No ratings yet
Quantitative Data Analysis CH 10
7 pages
Edtech O3
No ratings yet
Edtech O3
5 pages
Direct vs Indirect Sound in Spaces
No ratings yet
Direct vs Indirect Sound in Spaces
32 pages
Fredo Bang - Trust Issues Lyrics
No ratings yet
Fredo Bang - Trust Issues Lyrics
1 page
The German Ideology Explained
No ratings yet
The German Ideology Explained
18 pages
CV Format for Himanshu Kumar
No ratings yet
CV Format for Himanshu Kumar
3 pages
Jurnal Internasional PDF
No ratings yet
Jurnal Internasional PDF
8 pages
Statistical Process Control Analysis
No ratings yet
Statistical Process Control Analysis
2 pages
وظائف الترجمة في النصوص العربية
No ratings yet
وظائف الترجمة في النصوص العربية
4 pages
Insights on Surah Al-Hujurat Ethics
No ratings yet
Insights on Surah Al-Hujurat Ethics
26 pages
Overview of Gastritis for Nursing Students
100% (1)
Overview of Gastritis for Nursing Students
11 pages
Expert Perspectives On Soft Skills 2019
100% (1)
Expert Perspectives On Soft Skills 2019
109 pages
Yoga Examination Paper Instructions
No ratings yet
Yoga Examination Paper Instructions
12 pages
Refugee Crisis Faced by Srilankan Tamils in India
No ratings yet
Refugee Crisis Faced by Srilankan Tamils in India
11 pages
Orthodontic Case History Guidelines
No ratings yet
Orthodontic Case History Guidelines
63 pages
AP English Major Works Data Sheet
No ratings yet
AP English Major Works Data Sheet
6 pages
Recruitment Process at ICICI Yelahanka
No ratings yet
Recruitment Process at ICICI Yelahanka
5 pages
Conversation Activator: Speaking Booster (Unit 4, Page 43: Now You Can)
No ratings yet
Conversation Activator: Speaking Booster (Unit 4, Page 43: Now You Can)
1 page
Ca9 83 852
No ratings yet
Ca9 83 852
6 pages
Abdullah Adyar's Journey to Islam
100% (1)
Abdullah Adyar's Journey to Islam
2 pages
CAAB Form 19: AML Application Guide
No ratings yet
CAAB Form 19: AML Application Guide
3 pages
Job Performance Modeling: A Holistic Theoretical Analysis: WANG Yating YANG Yang
No ratings yet
Job Performance Modeling: A Holistic Theoretical Analysis: WANG Yating YANG Yang
10 pages
Solid State Physics Exam Guidelines
No ratings yet
Solid State Physics Exam Guidelines
2 pages
Parasitology Overview and Key Concepts
No ratings yet
Parasitology Overview and Key Concepts
20 pages
RUSA Behavioral Guidelines for Reference
No ratings yet
RUSA Behavioral Guidelines for Reference
10 pages
The Impact of Contact Angle On The Bioco
No ratings yet
The Impact of Contact Angle On The Bioco
13 pages
Porphyry S Homeric Questions On The Iliad Text Translation Commentary John A. Macphail Jr.
No ratings yet
Porphyry S Homeric Questions On The Iliad Text Translation Commentary John A. Macphail Jr.
77 pages
Business Environment Syllabus for M.Com
No ratings yet
Business Environment Syllabus for M.Com
11 pages
Religion's Impact in A Thousand Splendid Suns
No ratings yet
Religion's Impact in A Thousand Splendid Suns
7 pages
Analysis of The Best of Me by Sparks
No ratings yet
Analysis of The Best of Me by Sparks
5 pages
English Language Assessment 9th Grade
No ratings yet
English Language Assessment 9th Grade
4 pages

EDA Notes

Uploaded by

EDA Notes

Uploaded by

Exploratory Data Analysis (EDA)

The agenda of univariate analysis is to understand:

Segmented Univariate Analysis

• Bivariate analysis on categorical variables

So in general, there are two fundamental aspects of analysing categorical variables:

Business Driven Metrics:

Data Driven Metrics:

© Copyright. upGrad Education Pvt. Ltd. All rights reserved.

You might also like