0% found this document useful (0 votes)

13 views18 pages

Understanding Exploratory Data Analysis

Uploaded by

anila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views18 pages

Understanding Exploratory Data Analysis

Uploaded by

anila

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing data to understand its key
characteristics, patterns, and relationships through various methods.

Why Exploratory Data Analysis is Important?

● Helps to understand the dataset,

● EDA helps to identify hidden patterns and relationships between
different data points, which help us in and model building.
● Allows to spot errors or unusual data points (outliers) that could
affect your results.
● Insights that you obtain from EDA help you decide which features
are most important for building models and how to prepare them to
improve performance.
● By understanding the data, EDA helps us in choosing the best
modeling techniques and adjusting them for better results.

TYPES OF EXPLORATORY DATA ANALYSIS:

1. Univariate Non-graphical

2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: this is the simplest form of data analysis as

during this we use just one variable to research the info. The standard goal of
univariate non-graphical EDA is to know the underlying sample distribution/
data and make observations about the population. Outlier detection is
additionally part of the analysis. The characteristics of population distribution
include:
● Central tendency: .
● Spread:
● Skewness and kurtosis:

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is

usually used to show the connection between two or more variables within
the sort of either cross-tabulation or statistics. This can include techniques
like regression analysis or principal component analysis.

3. Univariate graphical: involves creating charts and graphs to explore a

single variable. This can help you understand the distribution of the data and
identify any outlier. Common sorts of univariate graphics are:

● Histogram:
● Stem-and-leaf plots:
● Boxplots:
● Quantile-normal plots:

4. Multivariate graphical: Multivariate graphical data uses graphics to

display relationships between two or more sets of knowledge.

common sorts of multivariate graphics are:

● grouped barplot
● Scatterplot
● Run chart:
● Heat map:
● Bubble chart:
Univariate data:

Univariate data refers to a type of data in which each observation or data

point corresponds to a single variable. Analyzing univariate data is the
simplest form of analysis in statistics.

Heights (in cm) 164 167.3 170 174.2 178 180 186

Suppose that the heights of seven students in a class is recorded (above

table). There is only one variable, which is height, and it is not dealing with
any cause or relationship.

Key points in Univariate analysis:

1. No Relationships: Univariate analysis focuses solely on describing

and summarizing the distribution of the single variable. It does not
explore relationships between variables or attempt to identify
causes.
2. Descriptive Statistics: Descriptive statistics, such as measures of
central tendency (mean, median, mode) and measures of dispersion
(range, standard deviation), are commonly used in the analysis of
univariate data.
3. Visualization: Histograms, box plots, and other graphical
representations are often used to visually represent the distribution
of the single variable.
Multivariate data

Multivariate data refers to datasets where each observation or sample point

consists of multiple variables or features.
Example of this type of data is suppose an advertiser wants to compare the
popularity of four advertisements on a website.

Advertisement Gender Click rate

Ad1 Male 80

Ad3 Female 55

Ad2 Female 123

Ad1 Male 66

Ad3 Male 35

The click rates could be measured for both men and women and
relationships between variables can then be examined. It is similar to
bivariate but contains more than one dependent variable.

Key points in Multivariate analysis:

1. Analysis Techniques:The ways to perform analysis on this data
depends on the goals to be achieved. Some of the techniques are
regression analysis, principal component analysis, path analysis,
factor analysis and multivariate analysis of variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique depends on the
specific goals of the study.
3. Interpretation:. It helps uncover patterns that may not be apparent
when examining variables individually.

Methods of EDA

1-Descriptive Statistics
● Definition: Summarizes the main features of a data set.
● Purpose: To provide a quick overview of the data.
● Techniques:
● Measures of central tendency (mean, median, mode).
● Measures of dispersion (range, variance, standard deviation).
● Frequency distributions.

2-Data visualization

‘Definition: Uses visual tools to explore data.

● Purpose: To identify patterns, trends, and data anomalies through
visualization.
● Techniques:
● Charts (bar charts, histograms, pie charts).
● Plots (scatter plots, line plots, box plots).
● Advanced visualizations (heatmaps, violin plots, pair plots).
Descriptive statistics

Various descriptive statistical methods are listed below:

● Measures of central tendencies
● Dispersion
● Skewness and kurtosis

Measures of central tendencies

Central Tendencies are the numerical values that are used to represent a large
collection of numerical data. These obtained numerical values are called central or
average values

Some commonly used measures of central tendency are:

● Mean
● Median
● Mode

Mean
The mean represents the average value of the dataset. It can be calculated as
the sum of all the values in the dataset divided by the number of values.
Mean = (Sum of all the observations/Total number of observations)

Eg/- mean of 2, 4, 6, 8 and 10

Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
Median
Median is the middle value of the dataset in which the dataset is arranged in
the ascending order or in descending order. When the dataset contains an
even number of values, then the median value of the dataset can be found by
taking the mean of the middle two values.

Consider the given dataset with the odd number of observations arranged in
descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2

Here 12 is the middle or median number that has 6 values above it and 6 values
below it.

Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22,
19, and 17
When you look at the given dataset, the two middle values obtained are
27 and 29. Now, find out the mean value for these two numbers.

i.e.,(27+29)/2 =28
Therefore, the median for the given data distribution is 28.

Mode
The mode represents the frequently occurring value in the dataset.
Sometimes the dataset may contain multiple modes and in some cases, it
does not contain any mode at all.

Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5

Dispersion
Dispersion in statistics is a way to describe how spread out or scattered the
data is around an average value. It helps to understand if the data points are
close together or far apart. Dispersion shows the variability or consistency in
a set of data. There are different measures of dispersion like range, variance,
and standard deviation
The measures of dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For example – Meters,
Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value in
the distribution.

Range = Highest Value - Lowest Value

Example
Find the range of the data 2, 7, 11, 12, 19, 22, 25, 27, 33, 35
Highest Value = 35
Lowest Value = 2
Range = Highest Value - Lowest Value = 35 - 2 = 33

Mean Deviation: It is the arithmetic mean of the difference between the values
and their mean.
Mean deviation formula

Standard Deviation: It is the square root of the arithmetic average of the square
of the deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the
given data set.

Suppose we have the data set {3, 5, 8, 1} and we want to find the population variance.
The mean is given as (3 + 5 + 8 + 1) / 4 = 4.25. Then by using the definition of variance
2 2 2 2
we get [(3 - 4.25) + (5 - 4.25) + (8 - 4.25) + (1 - 4.25) ] / 4 = 6.68. Thus, variance =
6.68.

Quartile Deviation: It is defined as half of the difference between the third

quartile and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile
is called Interterquartile Range. Its formula is given as Q3 – Q1.

Interquartile range = Upper Quartile (Q3)– Lower Quartile(Q1)

Relative Measure of Dispersion

We use relative measures of dispersion to measure the two quantities that have
different units to get a better idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the
highest and lowest value in a data set to the sum of the highest and lowest value.
Coefficient of Variation: It is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to
the value of the central point of the data set.
Coefficient of Quartile Deviation: It is defined as the ratio of the difference
between the third quartile and the first quartile to the sum of the third and first
quartiles.

Skewness and Kurtosis

Skewness is used to measure the level of asymmetry in our graph. It is the measure of
asymmetry that occurs when our data deviates from the norm.

Types of Skewness

There are three types of skewness: positive, negative, and zero skewness.

. A distribution with zero skewness has the following characteristics:

● Symmetric distribution with values evenly centered around the mean.

● No skew, lean or tail to either side.
● The mean, median, and mode are all at the center point.
mean, median, and mode may not form a perfect overlapping straight line. They may
be slightly away from each other but the difference would be too small to matter.

positive skewness (right-skewed):

● The right tail of the distribution is longer or fatter than the left.
● The mean is greater than the median, and the mode is less than both
mean and median.
● Lower values are clustered in the “hill” of the distribution, while extreme
values are in the long right tail.
● It is also known as right-skewed distribution.

negative skewness (left-skewed):

● The left tail of the distribution is longer or fatter than the right.
● The mean is less than the median, and the mode is greater than both
mean and median.
● Higher values are clustered in the “hill” of the distribution, while extreme
values are in the long left tail.
● It is also known as left-skewed distribution.

Kurtosis

While skewness focuses on the spread (tails) of normal distribution, kurtosis

focuses more on the height. It tells us how peaked or flat our normal (or
normal-like) distribution is.

High kurtosis indicates:

● Sharp peakedness in the distribution’s center.

● More values concentrated around the mean than normal distribution.
● Heavier tails because of a higher concentration of extreme values or
outliers in tails.
● Greater likelihood of extreme events.

On the other hand, low kurtosis indicates:

● Flat peak.
● Fewer values concentrated around the mean but still more than normal
distribution.
● Lighter tails.
● Lower likelihood of extreme events.

Depending on the degree, distributions have three types of kurtosis:

1. Mesokurtic distribution (kurtosis = 3, excess kurtosis = 0): perfect
normal distribution or very close to it.
2. Leptokurtic distribution (kurtosis > 3, excess kurtosis > 0): sharp
peak, heavy tails
3. Platykurtic distribution (kurtosis < 3, excess kurtosis < 0): flat peak,
light tails

EDA Techniques in Data Science
No ratings yet
EDA Techniques in Data Science
8 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
94 pages
Understanding Exploratory Data Analysis
100% (1)
Understanding Exploratory Data Analysis
13 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
131 pages
Call Duration Analysis in EDA
No ratings yet
Call Duration Analysis in EDA
77 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
13 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
54 pages
Chapter 5 Exploratory Data Analysis
No ratings yet
Chapter 5 Exploratory Data Analysis
67 pages
Types of Exploratory Data Analysis
No ratings yet
Types of Exploratory Data Analysis
9 pages
Univariate Analysis Techniques Explained
No ratings yet
Univariate Analysis Techniques Explained
2 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
8 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
EDA vs CDA in Data Analytics
No ratings yet
EDA vs CDA in Data Analytics
79 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
42 pages
DSSM 3
No ratings yet
DSSM 3
38 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
Unit 4 Fds
No ratings yet
Unit 4 Fds
22 pages
MDA Notes
No ratings yet
MDA Notes
195 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
36 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
21 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
25 pages
Data Analysis Techniques and Methods
No ratings yet
Data Analysis Techniques and Methods
6 pages
Types and Importance of EDA
No ratings yet
Types and Importance of EDA
14 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
47 pages
Types and Goals of Exploratory Data Analysis
No ratings yet
Types and Goals of Exploratory Data Analysis
5 pages
EDA Notes
No ratings yet
EDA Notes
6 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
Exploratory Data Analysis Overview
No ratings yet
Exploratory Data Analysis Overview
173 pages
Univariate, Bivariate, and Multivariate Analysis Techniques
No ratings yet
Univariate, Bivariate, and Multivariate Analysis Techniques
23 pages
Univariate, Bivariate, Multivariate Analysis
No ratings yet
Univariate, Bivariate, Multivariate Analysis
8 pages
Exploratory Data Analysis in Data Science
100% (3)
Exploratory Data Analysis in Data Science
113 pages
Overview of Exploratory Data Analysis
No ratings yet
Overview of Exploratory Data Analysis
15 pages
Importance of Exploratory Data Analysis
No ratings yet
Importance of Exploratory Data Analysis
2 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
3 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
6 pages
Unit 2 - DA - Statistical Concepts - Student
No ratings yet
Unit 2 - DA - Statistical Concepts - Student
56 pages
Module 5
No ratings yet
Module 5
30 pages
Importing Excel into Tableau for EDA
No ratings yet
Importing Excel into Tableau for EDA
222 pages
EDA vs CDA: Key Differences Explained
No ratings yet
EDA vs CDA: Key Differences Explained
68 pages
Exploratory Data Analysis in Machine Learning
No ratings yet
Exploratory Data Analysis in Machine Learning
53 pages
EDA Techniques for Data Analysis
No ratings yet
EDA Techniques for Data Analysis
12 pages
Understanding Univariate Analysis
No ratings yet
Understanding Univariate Analysis
142 pages
EDA Unit 3 Notes
100% (1)
EDA Unit 3 Notes
35 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
13 pages
Introduction to Business Statistics
No ratings yet
Introduction to Business Statistics
16 pages
EDA Tools: Python vs R Comparison
No ratings yet
EDA Tools: Python vs R Comparison
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
14 pages
Understanding EDA and Its Techniques
No ratings yet
Understanding EDA and Its Techniques
26 pages
Module 5 Notes of Dmbi
No ratings yet
Module 5 Notes of Dmbi
91 pages
Exploratory Data Analysis (EDA) Guide
No ratings yet
Exploratory Data Analysis (EDA) Guide
9 pages
Univariate and Bivariate Analysis Guide
No ratings yet
Univariate and Bivariate Analysis Guide
7 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
24 pages
Quantitative Data Analysis Techniques
No ratings yet
Quantitative Data Analysis Techniques
41 pages
Introduction To Exploratory Data Analysis
No ratings yet
Introduction To Exploratory Data Analysis
5 pages
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
No ratings yet
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
32 pages
Lecture 4,5,6 EDA and Problem Solving in Data Science
No ratings yet
Lecture 4,5,6 EDA and Problem Solving in Data Science
40 pages
Cmu Eda Chapter4
No ratings yet
Cmu Eda Chapter4
40 pages
Flowchart for Song Play Counts
No ratings yet
Flowchart for Song Play Counts
17 pages
Essential Steps in Data Cleaning
No ratings yet
Essential Steps in Data Cleaning
14 pages
Python List Data Type Explained
No ratings yet
Python List Data Type Explained
36 pages
Types and Analysis of Data in Statistics
No ratings yet
Types and Analysis of Data in Statistics
19 pages
Multi-Vari Analysis in Six Sigma
No ratings yet
Multi-Vari Analysis in Six Sigma
54 pages
IBM C1000-177 Data Science Exam Guide
No ratings yet
IBM C1000-177 Data Science Exam Guide
9 pages
Understanding Kurtosis in Statistics
No ratings yet
Understanding Kurtosis in Statistics
13 pages
Systematic Global Macro Strategies Overview
No ratings yet
Systematic Global Macro Strategies Overview
13 pages
Central Tendency in Discrete Series
No ratings yet
Central Tendency in Discrete Series
12 pages
QRM 03
No ratings yet
QRM 03
17 pages
Right-Skewed Q-Q Plot Analysis
No ratings yet
Right-Skewed Q-Q Plot Analysis
7 pages
Financial Markets & Reporting Risks
No ratings yet
Financial Markets & Reporting Risks
24 pages
Measuring Skewness - Forgotten Statistics
No ratings yet
Measuring Skewness - Forgotten Statistics
18 pages
Rainfall Trends in Ghana's Volta Region
No ratings yet
Rainfall Trends in Ghana's Volta Region
27 pages
Impact of Simulation Games on Learning
No ratings yet
Impact of Simulation Games on Learning
11 pages
General Statistics Course Overview
No ratings yet
General Statistics Course Overview
32 pages
Analysis of Sample Data Normality
No ratings yet
Analysis of Sample Data Normality
6 pages
Skewness and Kurtosis Explained
No ratings yet
Skewness and Kurtosis Explained
2 pages
Researchers World Vol. VIII, Issue 2 (7) - April - 2017 - P7
No ratings yet
Researchers World Vol. VIII, Issue 2 (7) - April - 2017 - P7
132 pages
Hospital Financial Performance Analysis
No ratings yet
Hospital Financial Performance Analysis
86 pages
E-Banking Satisfaction Analysis Report
No ratings yet
E-Banking Satisfaction Analysis Report
21 pages
Cashless Payment Adoption in Students
No ratings yet
Cashless Payment Adoption in Students
14 pages
Unit 4 Busniess Statistics
No ratings yet
Unit 4 Busniess Statistics
11 pages
Electricity Prices and Power Derivatives - Evidenc
No ratings yet
Electricity Prices and Power Derivatives - Evidenc
43 pages
Corporate Risk Management Case Study
No ratings yet
Corporate Risk Management Case Study
215 pages
CSU Entrance Exam Reviewer: Assessment Insights
No ratings yet
CSU Entrance Exam Reviewer: Assessment Insights
8 pages
Farm Data Analysis: Yield & Metrics
No ratings yet
Farm Data Analysis: Yield & Metrics
10 pages
Physical Properties of Virgin Olive Fruits
No ratings yet
Physical Properties of Virgin Olive Fruits
10 pages
Matrix and Calculus Concepts Quiz
No ratings yet
Matrix and Calculus Concepts Quiz
6 pages
SPSS Data Analysis of Respondent Characteristics
No ratings yet
SPSS Data Analysis of Respondent Characteristics
6 pages
2011 Gil Monte y Olivares SJP PDF
No ratings yet
2011 Gil Monte y Olivares SJP PDF
11 pages
Six Sigma DMAIC in Bag Production Analysis
No ratings yet
Six Sigma DMAIC in Bag Production Analysis
27 pages
Analyzing Adidas US Profit Factors
No ratings yet
Analyzing Adidas US Profit Factors
57 pages
Education Assessment Methods Overview
No ratings yet
Education Assessment Methods Overview
83 pages

Understanding Exploratory Data Analysis

Uploaded by

Understanding Exploratory Data Analysis

Uploaded by

Exploratory Data Analysis

Why Exploratory Data Analysis is Important?

●​ Helps to understand the dataset,

TYPES OF EXPLORATORY DATA ANALYSIS:

1.​ Univariate Non-graphical

1. Univariate Non-graphical: this is the simplest form of data analysis as

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is

3. Univariate graphical: involves creating charts and graphs to explore a

4. Multivariate graphical: Multivariate graphical data uses graphics to

common sorts of multivariate graphics are:

Univariate data refers to a type of data in which each observation or data

Suppose that the heights of seven students in a class is recorded (above

Key points in Univariate analysis:

1.​ No Relationships: Univariate analysis focuses solely on describing

Multivariate data refers to datasets where each observation or sample point

Advertisement Gender Click rate

Ad2 Female 123

Key points in Multivariate analysis:

‘Definition: Uses visual tools to explore data.

Various descriptive statistical methods are listed below:

Measures of central tendencies

Some commonly used measures of central tendency are:

Eg/- mean of 2, 4, 6, 8 and 10

Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5

Range = Highest Value - Lowest Value

Quartile Deviation: It is defined as half of the difference between the third

Interquartile range = Upper Quartile (Q3)– Lower Quartile(Q1)

Relative Measure of Dispersion

Skewness and Kurtosis

. A distribution with zero skewness has the following characteristics:

●​ Symmetric distribution with values evenly centered around the mean.

positive skewness (right-skewed):

negative skewness (left-skewed):

While skewness focuses on the spread (tails) of normal distribution, kurtosis

High kurtosis indicates:

●​ Sharp peakedness in the distribution’s center.

On the other hand, low kurtosis indicates:

Depending on the degree, distributions have three types of kurtosis:

You might also like

● Helps to understand the dataset,

1. Univariate Non-graphical

1. No Relationships: Univariate analysis focuses solely on describing

● Symmetric distribution with values evenly centered around the mean.

● Sharp peakedness in the distribution’s center.