0% found this document useful (0 votes)
13 views18 pages

Understanding Exploratory Data Analysis

Uploaded by

anila
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

Understanding Exploratory Data Analysis

Uploaded by

anila
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Exploratory Data Analysis

Exploratory Data Analysis (EDA) involves analyzing data to understand its key
characteristics, patterns, and relationships through various methods.

Why Exploratory Data Analysis is Important?

●​ Helps to understand the dataset,


●​ EDA helps to identify hidden patterns and relationships between
different data points, which help us in and model building.
●​ Allows to spot errors or unusual data points (outliers) that could
affect your results.
●​ Insights that you obtain from EDA help you decide which features
are most important for building models and how to prepare them to
improve performance.
●​ By understanding the data, EDA helps us in choosing the best
modeling techniques and adjusting them for better results.

TYPES OF EXPLORATORY DATA ANALYSIS:

1.​ Univariate Non-graphical


2.​ Multivariate Non-graphical
3.​ Univariate graphical
4.​ Multivariate graphical

1. Univariate Non-graphical: this is the simplest form of data analysis as


during this we use just one variable to research the info. The standard goal of
univariate non-graphical EDA is to know the underlying sample distribution/
data and make observations about the population. Outlier detection is
additionally part of the analysis. The characteristics of population distribution
include:
●​ Central tendency: .
●​ Spread:
●​ Skewness and kurtosis:

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is


usually used to show the connection between two or more variables within
the sort of either cross-tabulation or statistics. This can include techniques
like regression analysis or principal component analysis.

3. Univariate graphical: involves creating charts and graphs to explore a


single variable. This can help you understand the distribution of the data and
identify any outlier. Common sorts of univariate graphics are:

●​ Histogram:
●​ Stem-and-leaf plots:
●​ Boxplots:
●​ Quantile-normal plots:

4. Multivariate graphical: Multivariate graphical data uses graphics to


display relationships between two or more sets of knowledge.

common sorts of multivariate graphics are:

●​ grouped barplot
●​ Scatterplot
●​ Run chart:
●​ Heat map:
●​ Bubble chart:
Univariate data:

Univariate data refers to a type of data in which each observation or data


point corresponds to a single variable. Analyzing univariate data is the
simplest form of analysis in statistics.

Heights (in cm) 164 167.3 170 174.2 178 180 186

Suppose that the heights of seven students in a class is recorded (above


table). There is only one variable, which is height, and it is not dealing with
any cause or relationship.

Key points in Univariate analysis:

1.​ No Relationships: Univariate analysis focuses solely on describing


and summarizing the distribution of the single variable. It does not
explore relationships between variables or attempt to identify
causes.
2.​ Descriptive Statistics: Descriptive statistics, such as measures of
central tendency (mean, median, mode) and measures of dispersion
(range, standard deviation), are commonly used in the analysis of
univariate data.
3.​ Visualization: Histograms, box plots, and other graphical
representations are often used to visually represent the distribution
of the single variable.
Multivariate data

Multivariate data refers to datasets where each observation or sample point


consists of multiple variables or features.
Example of this type of data is suppose an advertiser wants to compare the
popularity of four advertisements on a website.

Advertisement Gender Click rate

Ad1 Male 80

Ad3 Female 55

Ad2 Female 123

Ad1 Male 66

Ad3 Male 35

The click rates could be measured for both men and women and
relationships between variables can then be examined. It is similar to
bivariate but contains more than one dependent variable.

Key points in Multivariate analysis:


1.​ Analysis Techniques:The ways to perform analysis on this data
depends on the goals to be achieved. Some of the techniques are
regression analysis, principal component analysis, path analysis,
factor analysis and multivariate analysis of variance (MANOVA).
2.​ Goals of Analysis: The choice of analysis technique depends on the
specific goals of the study.
3.​ Interpretation:. It helps uncover patterns that may not be apparent
when examining variables individually.

Methods of EDA

1-Descriptive Statistics
● Definition: Summarizes the main features of a data set.
● Purpose: To provide a quick overview of the data.
● Techniques:
● Measures of central tendency (mean, median, mode).
● Measures of dispersion (range, variance, standard deviation).
● Frequency distributions.

2-Data visualization

‘Definition: Uses visual tools to explore data.


● Purpose: To identify patterns, trends, and data anomalies through
visualization.
● Techniques:
● Charts (bar charts, histograms, pie charts).
● Plots (scatter plots, line plots, box plots).
● Advanced visualizations (heatmaps, violin plots, pair plots).
Descriptive statistics

Various descriptive statistical methods are listed below:


● Measures of central tendencies
● Dispersion
● Skewness and kurtosis

Measures of central tendencies

Central Tendencies are the numerical values that are used to represent a large
collection of numerical data. These obtained numerical values are called central or
average values

Some commonly used measures of central tendency are:


● Mean
● Median
● Mode

Mean
The mean represents the average value of the dataset. It can be calculated as
the sum of all the values in the dataset divided by the number of values.
Mean = (Sum of all the observations/Total number of observations)

Eg/- mean of 2, 4, 6, 8 and 10


Solution:
First, add all the numbers.
2 + 4 + 6 + 8 + 10 = 30
Now divide by 5 (total number of observations).
Mean = 30/5 = 6
Median
Median is the middle value of the dataset in which the dataset is arranged in
the ascending order or in descending order. When the dataset contains an
even number of values, then the median value of the dataset can be found by
taking the mean of the middle two values.

Consider the given dataset with the odd number of observations arranged in
descending order – 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2

Here 12 is the middle or median number that has 6 values above it and 6 values
below it.

Now, consider another example with an even number of observations that are
arranged in descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22,
19, and 17
When you look at the given dataset, the two middle values obtained are
27 and 29. Now, find out the mean value for these two numbers.

i.e.,(27+29)/2 =28
Therefore, the median for the given data distribution is 28.

Mode
The mode represents the frequently occurring value in the dataset.
Sometimes the dataset may contain multiple modes and in some cases, it
does not contain any mode at all.

Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5


Dispersion
Dispersion in statistics is a way to describe how spread out or scattered the
data is around an average value. It helps to understand if the data points are
close together or far apart. Dispersion shows the variability or consistency in
a set of data. There are different measures of dispersion like range, variance,
and standard deviation
The measures of dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For example – Meters,
Dollars, Kg, etc.
Some absolute measures of dispersion are:
Range: It is defined as the difference between the largest and the smallest value in
the distribution.

Range = Highest Value - Lowest Value


Example
Find the range of the data 2, 7, 11, 12, 19, 22, 25, 27, 33, 35
Highest Value = 35
Lowest Value = 2
Range = Highest Value - Lowest Value = 35 - 2 = 33

Mean Deviation: It is the arithmetic mean of the difference between the values
and their mean.
Mean deviation formula

Standard Deviation: It is the square root of the arithmetic average of the square
of the deviations measured from the mean.
Variance: It is defined as the average of the square deviation from the mean of the
given data set.

Suppose we have the data set {3, 5, 8, 1} and we want to find the population variance.
The mean is given as (3 + 5 + 8 + 1) / 4 = 4.25. Then by using the definition of variance
2 2 2 2
we get [(3 - 4.25) + (5 - 4.25) + (8 - 4.25) + (1 - 4.25) ] / 4 = 6.68. Thus, variance =
6.68.

Quartile Deviation: It is defined as half of the difference between the third


quartile and the first quartile in a given data set.
Interquartile Range: The difference between upper(Q3 ) and lower(Q1) quartile
is called Interterquartile Range. Its formula is given as Q3 – Q1.

Interquartile range = Upper Quartile (Q3)– Lower Quartile(Q1)

Relative Measure of Dispersion

We use relative measures of dispersion to measure the two quantities that have
different units to get a better idea about the scattering of the data.
Here are some of the relative measures of dispersion:
Coefficient of Range: It is defined as the ratio of the difference between the
highest and lowest value in a data set to the sum of the highest and lowest value.
Coefficient of Variation: It is defined as the ratio of the standard deviation to the
mean of the data set. We use percentages to express the coefficient of variation.
Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to
the value of the central point of the data set.
Coefficient of Quartile Deviation: It is defined as the ratio of the difference
between the third quartile and the first quartile to the sum of the third and first
quartiles.

Skewness and Kurtosis


Skewness is used to measure the level of asymmetry in our graph. It is the measure of
asymmetry that occurs when our data deviates from the norm.

Types of Skewness

There are three types of skewness: positive, negative, and zero skewness.

. A distribution with zero skewness has the following characteristics:

●​ Symmetric distribution with values evenly centered around the mean.


●​ No skew, lean or tail to either side.
●​ The mean, median, and mode are all at the center point.
mean, median, and mode may not form a perfect overlapping straight line. They may
be slightly away from each other but the difference would be too small to matter.

positive skewness (right-skewed):

●​ The right tail of the distribution is longer or fatter than the left.
●​ The mean is greater than the median, and the mode is less than both
mean and median.
●​ Lower values are clustered in the “hill” of the distribution, while extreme
values are in the long right tail.
●​ It is also known as right-skewed distribution.

negative skewness (left-skewed):


●​ The left tail of the distribution is longer or fatter than the right.
●​ The mean is less than the median, and the mode is greater than both
mean and median.
●​ Higher values are clustered in the “hill” of the distribution, while extreme
values are in the long left tail.
●​ It is also known as left-skewed distribution.

Kurtosis

While skewness focuses on the spread (tails) of normal distribution, kurtosis


focuses more on the height. It tells us how peaked or flat our normal (or
normal-like) distribution is.

High kurtosis indicates:

●​ Sharp peakedness in the distribution’s center.


●​ More values concentrated around the mean than normal distribution.
●​ Heavier tails because of a higher concentration of extreme values or
outliers in tails.
●​ Greater likelihood of extreme events.

On the other hand, low kurtosis indicates:


●​ Flat peak.
●​ Fewer values concentrated around the mean but still more than normal
distribution.
●​ Lighter tails.
●​ Lower likelihood of extreme events.

Depending on the degree, distributions have three types of kurtosis:


1.​ Mesokurtic distribution (kurtosis = 3, excess kurtosis = 0): perfect
normal distribution or very close to it.
2.​ Leptokurtic distribution (kurtosis > 3, excess kurtosis > 0): sharp
peak, heavy tails
3.​ Platykurtic distribution (kurtosis < 3, excess kurtosis < 0): flat peak,
light tails

You might also like