0% found this document useful (0 votes)

11 views43 pages

Introduction to Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial first step in data science that involves summarizing and visualizing data to understand its main features. Originating from John W. Tukey's work in the 1960s, EDA has evolved with modern computing to handle large datasets and inform data-driven decisions. Key concepts include structured vs. unstructured data, types of data, and various statistical measures such as mean, median, and variability, which help in analyzing and interpreting data effectively.

Uploaded by

zaralightlybloum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views43 pages

Introduction to Exploratory Data Analysis

Uploaded by

zaralightlybloum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Module 1 (Exploratory Data Analysis)

Introduction to Exploratory Data Analysis (EDA) Meaning:

• Exploratory Data Analysis (EDA) is the first and most important step in
any data science project.

• It involves summarizing, visualizing, and understanding the main

features of data before applying formal statistical or machine learning
models.

Background and Origin

• Classical statistics mainly focused on statistical inference — i.e.,

drawing conclusions about large populations from small samples
using complex mathematical procedures.

• In 1962, John W. Tukey proposed a new scientific discipline called

“Data Analysis” in his paper “The Future of Data Analysis.”

• Tukey’s idea was to include data exploration as a key component of

statistics, not just inference.

• He connected statistics with engineering and computer science, even

coining the terms:

◦ Bit for binary digit

◦ Software

Tukey’s Contribution

• In 1977, Tukey published the classic book “Exploratory Data Analysis.”

• He introduced simple yet powerful techniques to understand data:

◦ Boxplots

◦ Scatterplots

◦ Summary statistics like mean, median, and quantiles

• His approach emphasized visual understanding of data rather than

relying only on formulas.
Modern Development of EDA

• With advanced computing power and data analysis software (R,

Python, etc.), EDA has evolved beyond its original scope.

• Key drivers of modern EDA:

1. Rapid technology development

2. Availability of large and complex datasets

3. Increased use of data-driven decisions in business,

research, and engineering

David Donoho’s View

• David Donoho, a professor at Stanford and former student of

Tukey, emphasized Tukey’s lasting impact.

• In his 2015 paper, Donoho showed that modern data science traces its
origins to Tukey’s pioneering work in data analysis.

Elements of Structured Data

1. Meaning of Data

• Data comes from various sources: sensors, text, images, videos,

events, and more.

• The Internet of Things (IoT) generates continuous data streams.

• Most raw data is unstructured (e.g., images = pixels, text = words).

• For statistical analysis, unstructured data must be converted into a

structured format.

2. Structured Data

• Structured data is organized in rows and columns (like a

table or spreadsheet).
• This structure allows easy storage, retrieval, and analysis.

• Common examples: data from relational databases or CSV files.

3. Types of Structured Data

A. Numeric Data

• Represent quantities or numbers.

• Two forms:

◦ Continuous: can take any value within an interval

Example: temperature, wind speed, time duration.

◦ Discrete: take only integer or count values

Example: number of students, number of calls received.
B. Categorical Data

• Represent fixed sets of possible categories.

• Examples: gender, product type, state name.

• Subtypes:

◦ Binary Data: only two possible values (e.g., 0/1, Yes/No, True/False).

◦ Ordinal Data: categories have a meaningful order

Example: rating scale (poor, good, excellent) or numeric ranks (1, 2,
3, 4, 5).
4. Importance of Data Types

• Helps decide:

◦ Which visualization to use (e.g., histogram, bar chart).

◦ Which statistical model or algorithm to apply.

• Data science tools like R and Python use data types to:

◦ Improve computational efficiency.

◦ Control how data is handled in analysis.
5. Handling Data in

Software In R:

• [Link]()automatically converts text columns into factors (categorical

variables).

• Assigning new values outside predefined categories causes a warning or

NA
(missing value).

In Python (pandas):

• Text columns remain as text by default.

• You can explicitly define columns as categorical using:

pd.read_csv('[Link]', dtype={'column_name': 'category'})

6. Key Advantages of Identifying Data Types

• Tells the software how to process and visualize data.

• Improves storage efficiency and indexing (as in databases).

• Ensures only valid values are entered (like enums).

7. Key Terms Summary

Term Definition /
Example
Numeric Data
Numbers (continuous or
discrete)
Any value in a range (e.g.,
Continuous
time, weight)
Counted numbers (e.g.,
Discrete
number of students)
Fixed set of values (e.g.,
Categorical
colors, countries)
Binary Two possible values (Yes/No, 0/1)
Ordered categories (Low, Medium,
Ordinal
High)

Rectangular Data

1. Meaning

• Rectangular data refers to a two-dimensional table (rows × columns).

• It is the standard format used in data analysis and machine learning.

• Example: Excel sheet, CSV file, or database table.

2. Structure of Rectangular Data

• Rows Records (observations, samples, instances)

• Columns Variables or Features (attributes, predictors)

• Each cell contains a value for a specific record and variable.

In R and Python, this structure is called a Data Frame.

3. Common Terms

Term Meaning / Synonyms

Data Frame Rectangular data structure used in R and Python
Feature A column in the data (also called variable, predictor, or input)
Outcome / The variable to be predicted (also called dependent variable,
Target response, or output)
Record A row in the data (also called case, instance, or observation)

4. Example of a Data Frame

Category Currency Seller Rating Duration EndDay ClosePrice OpenPri ce Competitiv e?

Music US 3249 5 Mon 0.01 0.01 0
Automot
US 3115 7 Tue 0.01 0.01 1
i ve

Explanation:

• The table includes both numeric data (e.g., price, duration) and
categorical data (e.g., category, currency).

• The “Competitive?” column is a binary variable (0 = No, 1 =

Yes) representing the outcome.

5. Data Frames and Indexes

In Python (pandas):

• Data stored as a DataFrameobject.

• Each row has an automatic index (0, 1, 2, ...).

• You can create multi-level indexes to speed up operations.

In R:

• Data stored as a [Link].

• Rows are indexed by position.

• Packages like [Link] dplyrprovide faster data manipulation and

support multi-indexing.

6. Terminology Differences

• Statisticians say:

◦ Predictor variables predict a response variable.

• Data scientists say:

◦ Features predict a target.

• Note:
◦ “Sample” means one record to a computer scientist, but a
collection of records to a statistician.

7. Non-Rectangular Data Structures

Not all data is rectangular. Some important types are:

a. Time Series Data

• Records measurements of a variable over time.

• Used in forecasting, finance, IoT devices, etc.

• Example: daily temperature readings.

b. Spatial Data

• Related to geographical or spatial locations.

• Example: house price based on location coordinates.

• Represented as either:

◦ Object view: each item has coordinates (e.g., a house on a map).

◦ Field view: data for every small region (e.g., pixel brightness).

c. Graph or Network Data

• Represents connections or relationships between entities.

• Examples:

◦ Social network (people connected by friendships).

◦ Transportation network (cities connected by roads).

• Used in network analysis and recommendation systems.

Estimates of Location

1. Meaning
• Location (Central Tendency) shows where the center or “typical” value
of a dataset lies.
• It gives an idea of what is normal or expected for a variable.

• Common measures: Mean, Median, Weighted Mean, Trimmed Mean.

2. Key Terms

Synonyms / Notes
Term Meaning / Definition

Average value (sum of all values ÷ number

Mean Average
of values)
Weighted Average giving importance (weight) to each Weighted
Mean value Average
Median Middle value when data is sorted 50th Percentile
Percentile Value below which a certain % of data lies Quantile
Trimmed Mean Mean after removing extreme high & low Truncated
values Mean
Robust Not affected much by outliers Resistant
Unusually high or low value compared to
Outlier Extreme Value
others

3. Mean (Arithmetic Mean)

Where

• (x_i) = individual values

• (n) = number of observations

Example:
Data = {3, 5, 1, 2}

Mean = {{3 + 5 + 1 + 2}/4}= 2.75

Used when data is normally distributed (no extreme values).

4. Trimmed Mean

• Definition: Mean computed after removing a fixed percentage of lowest

and highest values.

• Removes the effect of outliers.

Useful when data has outliers — e.g., exam scores or sports judging (diving,
gymnastics).

5. Weighted Mean

• Definition: Mean that gives different importance (weights) to data points.

Example:

Value (x) Weight (w)

10 1
20 2
30 3

Used when data from groups or sensors have different accuracy or importance.

There are two main reasons to use a weighted mean:

1. Different accuracy: Some values are less reliable (like readings from a
less accurate sensor), so we give them less weight when finding the
average.

2. Unequal representation: Sometimes, the data doesn’t equally represent

all groups. To fix that, we give more weight to the groups that have fewer
samples so the overall result is fair.
6. Median

• Definition: The middle value when data is arranged in ascending order.

• If n is odd: middle value.

• If n is even: average of two middle values.

Example:

Data = {1, 2, 4, 5, 6} 4
Data = {1, 2, 4, 5} 4) / 2 = 3

Median is robust to outliers (not affected by extreme values).

Example: Average income in a city — use median instead of mean because one
billionaire can distort the average.

7. Weighted Median

• Each value has a weight (importance).

• The median is the point where half the total weight lies below and half
above.

Used when sample sizes or group importance differ.

A weighted median works like a normal median, but it also considers the
importance (weight) of each value. We sort the data and find the point where
the total weight on both sides is the same. It helps when some data points are
more important than others and is not affected much by extreme values
(outliers).

Example:
Suppose three sensors give temperature readings:

• Sensor A: 20°C (weight 3 — very reliable)

• Sensor B: 25°C (weight 1 — less reliable)

• Sensor C: 30°C (weight 2 — moderately reliable)

When we arrange and consider weights, the weighted median will be around
25°C, because that’s the point where the total weight below and above is
balanced.
8. Outliers

• Definition: Data points that are very different from the rest.

• Can occur due to:

◦ Data entry or measurement errors.

◦ Genuine rare events (e.g., Bill Gates’ income).

• Impact: Strongly affects mean but not median.

• Outliers should be investigated and handled carefully.

Estimates of Variability

1. Meaning

• Variability (Dispersion) tells how spread out or consistent the data

values are.

• It measures how far values deviate from the center (mean or median).

• High variability values differ widely.

• Low variability values are close together.

Location = “where” data is centered
Variability = “how much” data varies around that center

2. Key Terms

Term Definition Synonyms /

Notes
Difference between a value and
Deviation Error, Residual
the mean
Average of squared deviations Mean-squared
Variance
from the mean error
Standard Deviation (SD) Square root of variance Spread measure
Mean Absolute Deviation L₁-norm,
Average of absolute deviations
(MAD₁) Manhattan
norm
Median Absolute Median of absolute deviations
Robust to outliers
Deviation (MADₘ) from the median
Difference between largest and Simple but
Range
smallest value sensitive
Difference between 75th and Spread of middle
Interquartile Range
25th percentiles 50%
(IQR)
Order Statistics Values after sorting data Ranks, Quantiles

3. Measuring Variability – Step-by-Step

a) Mean Absolute Deviation (MAD₁)

Simple to compute
Not commonly used in theory (uses absolute values, hard to manipulate
mathematically)

b) Variance (s²)

Measures how far data points deviate from

mean Sensitive to outliers (since deviations are
squared)

c) Standard Deviation (s)

Easier to interpret (same unit as original data)

Why Divide by (n–1)?

• Using n–1 gives an unbiased estimate of population variance.

• “n–1” is called degrees of freedom (because one constraint — the mean
— is already used).

• For large n, difference between n and n–1 is negligible.

d) Median Absolute Deviation (MADₘ) — Robust Measure

Resistant to outliers
Good for skewed or non-normal data
Sometimes multiplied by 1.4826 to make it comparable to SD for normal
distributions.

e) Range

Range = Max - Min

Example:
Data = {2, 6, 9, 15}
Range = 15 – 2 = 13

Very simple
Highly sensitive to outliers (only uses two values)

f) Percentiles and Interquartile Range

(IQR) Percentile:
Value below which a certain % of data falls.

• 25th percentile = Q1
• 75th percentile = Q3

• 50th percentile = Median

IQR Formula:

IQR = Q3 - Q1

Example:
Data (sorted) = {1, 2, 3, 3, 5, 6, 7, 9}

• Q1 = 2.5

• Q3 = 6.5

IQR = 6.5 - 2.5 = 4

Measures spread of the middle 50% of data
Not affected by outliers
Exploring the Data Distribution
1. Meaning

• After finding measures of location and variability, we explore the overall

shape and pattern of the data.

• Helps to identify:

◦ Symmetry or skewness

◦ Outliers

◦ Tails (extreme values)

◦ Clusters or gaps

Visualization = most effective way to understand data distribution.

So far, we’ve looked at ways to describe data using a single number — like the
mean, median, or standard deviation — to show the center or spread of the data.
But it’s also important to look at how the data values are spread out overall,
that is, the shape or pattern of the entire distribution, not just one summary
number.

2. Key Terms

Term Definition / Purpose

Boxplot Visualizes distribution using quartiles (Tukey’s method)
Frequency
Groups data into intervals (bins) and counts occurrences
Table
Histogram Graph of frequency table; shows data spread visually
Smooth curve showing data distribution (probability
Density Plot
density)

3. Percentiles and Boxplots

Percentiles

• A percentile indicates the value below which a given percentage of data lies.

◦ 25th percentile (Q1): 25% of data below

◦ 50th percentile: median

◦ 75th percentile (Q3): 75% of data below

The median is 4 murders per 100,000 people, although there is quite a bit of
variability: the 5th percentile is only 1.6 and the 95th percentile is 6.51.

Boxplot

Boxplots, introduced by Tukey [Tukey-1977], are based on percentiles and

give a quick way to visualize the distribution of data. Figure 1-2 shows a
boxplot of the population by state produced by R:
boxplot(state[['Population']]/1000000, ylab='Population (millions)')
pandas provides a number of basic exploratory plots for data frame; one of

them is boxplots:
ax = (state['Population']/1_000_000).[Link]()
ax.set_ylabel('Population (millions)')
A boxplot helps us quickly understand how the data is spread.

 The middle line inside the box shows the median (middle value).
In this example, the median state population is about 5 million.
 The box shows where the middle half of the states fall:
between 2 million and 7 million people.
 The top of the box is the 75th percentile (upper-middle value).
 The bottom of the box is the 25th percentile (lower-middle value).
 The whiskers (the dotted or straight lines extending from the box) show the range where
most of the states fall.
 Any points outside the whiskers are outliers — states with very high or very low
populations compared to others.
4. Frequency Table

• Divides the data range into equal-width bins and counts observations per bin.

• Shows how data is distributed across

intervals. Example (Population by State):

In R

Simple Explanation of the Histogram Bins

 Wyoming has the smallest population (about 5.6 lakh)

 California has the largest population (about 3.7 crore)

So the range of populations is:

37,253,956 – 563,626 = 36,690,330

To create a histogram, we divide this big range into equal parts called bins.

 If we choose 10 bins, each bin will cover about 3.6 million people.
 The first bin covers populations from 563,626 to 4,232,658.
 The top bin covers 33,584,923 to 37,253,956, and it contains only California, because
no other state has such a high population.
 The bins just below California are empty (no states fall in that population range).
These empty bins are important because they tell us there are no states in that range.
 Choosing bin size matters:
o If bins are too big, you hide important patterns.
o If bins are too small, the graph becomes confusing and noisy.

5. Histogram

• A bar-like plot that represents frequencies of bins.

• X-axis: variable values (divided into bins)

• Y-axis: frequency (count or proportion)

Features:

• Bars are contiguous (touch each other)

• Empty bins show gaps (no data there)

• Number of bins affects clarity

◦ Too few oversimplified

Helps identify skewness and outliers.

6. Density Plot

• A smooth version of a histogram showing continuous data distribution.

• Uses Kernel Density Estimation (KDE) to draw the curve.

• Area under the curve = 1 (represents 100% of the data).

Example (Murder Rate Density Plot):

In R
hist(state[['[Link]']], freq=FALSE)
lines(density(state[['[Link]']]), lwd=3, col='blue')

In Python
ax = state['[Link]'].[Link](density=True, xlim=[0,12], bins=range(1,12))
state['[Link]'].[Link](ax=ax)
ax.set_xlabel('Murder Rate (per 100,000)')

Shows peaks, tails, and skewness more smoothly than histogram.

Controlled by bandwidth (bw) — small bw = more detail, large bw = smoother.
7. Comparison of Graphs

Graph Type Data Type Shows Best For

Median, Quick summary
Boxplot Numeric
Quartiles, comparison
Outliers
Histogram Numeric Frequency of bins Shape of distribution
Density Plot Numeric Smooth distribution Continuous data
curve
Frequency Numeric/
Tabular count Precise frequency info
Table Categorical

Exploring Binary and Categorical Data

1. What is Categorical Data?

Categorical data means data that represents groups or labels, not numbers.

Examples:

 Gender (Male/Female)
 Type of delay (Weather, Security, ATC, Inbound)
 Blood group (A, B, O)

Binary data is a special type of categorical data with only two values: Yes/No, 0/1, True/False.

2. How to Summarize Categorical or Binary Data?

✔ Just use proportions or percentages

Example:
At Dallas/Fort Worth Airport (DFW), delays since 2010 are caused by:

Cause % of Delays
Carrier 23.02%
ATC 30.40%
Weather 4.03%
Security 0.12%
Inbound aircraft 42.43%

This tells us the largest cause of delays is Inbound aircraft (42.43%).

3. Bar Charts (Best for Categorical Data)
✔Why bar charts?

 Show how many times each category occurs

 Easy to compare categories
 X-axis = category names
 Y-axis = count or percentage
 Bars do NOT touch each other

✔R code
barplot([Link](dfw) / 6, [Link]=0.8, [Link]=0.7,
xlab='Cause of delay', ylab='Count')

✔Python Pandas code

ax = [Link]().[Link](figsize=(4, 4), legend=False)
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')

4. Bar Chart vs Histogram

Feature Bar Chart Histogram
X-axis shows Categories Numeric ranges (bins)
Bars Separated Touch each other
Used for Categorical data Numerical data

5. Pie Charts

Pie charts show proportions in a circle.

But statisticians avoid them because:

 Hard to compare slice sizes

 Not very accurate visually

Bar charts are a better choice.

6. Mode (Most Frequent Category)
✔ Definition

The mode is the category that appears most often.

�Examples:

 At DFW Airport, the mode of delay causes = Inbound (highest percentage).

 In the U.S., the mode for religion might be Christian.

Mode is mainly used for categorical data, not numeric data.

7. Expected Value (Weighted Average for Categorical Choices)

Sometimes categorical choices have numeric values attached.

�Example:

A cloud service has:

 5% customers choosing ₹300 plan

 15% choosing ₹50 plan
 80% choosing nothing (₹0)
8. Why Expected Value is Important?

Expected value is used in:

 Business decision making

 Profit estimation
 Risk analysis
 Capital budgeting
 Predicting future gains or losses

Examples:

 Expected five-year profit of a new startup

 Expected cost savings of a new hospital software
 Expected returns from an investment

Probability
✔ What is Probability?

Probability tells us how likely something is to happen.

You see probability in everyday life:

 Weather forecast → ―40% chance of rain‖

 Sports → ―Team has 70% chance of winning‖
✔ Probability and Odds

Sometimes probability is expressed as odds, especially in sports and games.

Example:

Odds of winning = 2 to 1

So, probability of winning = 2/3 = 66.7%

✔ Formal meaning

Formally, probability means:

If you repeat an event many, many times, the probability is the proportion of times it would
happen.

Example:

If you flip a coin thousands of times:

 Heads will come about 50%

 Tails will come about 50%

Even though each individual flip is uncertain, over many flips, the proportion settles near 0.5.

This idea is enough for most real-world use.

Definition:
Probability = Number of times an event happens ÷ Number of times you try the experiment (if
repeated many times).

Correlation
[Link] is Correlation?

Correlation tells us how strongly two variables move together.

 Positive correlation:
When X increases, Y also increases.
Example → Hours studied ↑, marks ↑
 Negative correlation:
When X increases, Y decreases.
Example → Speed ↑, travel time ↓
 Zero correlation:
No relationship.

[Link] of Perfect Positive Correlation

v1 = {1, 2, 3}
v2 = {4, 5, 6}

Both increase together → perfect correlation.

[Link] Simple ―Sum of Products‖ is Not Useful

Multiply each pair:

1×4 + 2×5 + 3×6 = 32

If you shuffle values in v2, this sum will never be more than 32.
But this number alone is not meaningful.

We need a measure that:

✔ Is standardized
✔ Lies between –1 and +1
✔ Works for all scales

So we use Pearson’s correlation coefficient.

[Link]’s Correlation Coefficient

To compute Pearson’s correlation coefficient, we multiply deviations from the mean for

variable 1 times those for variable 2, and divide by the product of the standard

deviations:Result is always between:

 +1 → perfect positive
 0 → no relationship
 –1 → perfect negative

We divide by n–1 (not n) because we are using sample standard deviation, which adjusts for degrees
of freedom.

[Link] is Only for Linear Relationships

If the relationship is curved or non-linear, correlation does not work well.
Example:
Tax rate vs revenue:

 At first, tax increases → revenue increases

 After a point, tax increases → people avoid taxes → revenue decreases
This is not a straight-line relationship → correlation is misleading.

[Link] Matrix
A correlation matrix shows the correlation between each pair of variables.

Example:
Stocks A, B, C
You find correlation between:

 A and A
 A and B
 A and C
 B and B
 B and C
 C and C

The diagonal always has 1 (correlation of a variable with itself).

You will also see repeated values above and below the diagonal → matrix is symmetrical.

SIMPLE Example of Correlation Matrix

Suppose you have data

Day Stock A Stock B Stock C

1 10 20 30
2 12 22 28
3 14 24 27

Step 1: Calculate correlation between every pair:

 Corr(A, B)
 Corr(A, C)
 Corr(B, C)

Use Pearson formula (or software).

Final matrix:
A B C
A 1.0 0.99 –0.3
B 0.99 1.0 –0.2
C –0.3 –0.2 1.0

Values here are just examples.

How to Compute a Correlation Matrix in Python
import pandas as pd

data = {
'A': [10, 12, 14],
'B': [20, 22, 24],
'C': [30, 28, 27]
}

df = [Link](data)

corr_matrix = [Link]()
print(corr_matrix)

How to Compute a Correlation Matrix in R

data <- [Link](
A = c(10, 12, 14),
B = c(20, 22, 24),
C = c(30, 28, 27)
)

cor_matrix <- cor(data)

print(cor_matrix)

Correlation and Outliers

✔Correlation is sensitive to outliers

Just like mean and standard deviation, the correlation coefficient can be badly affected by outliers.

Example:
If most values follow a pattern, but one value is extremely large or extremely small, that single point
can pull the correlation up or down and give a misleading result.

So:

 One extreme value → correlation changes a lot

 This means correlation is not robust

Scatterplots
[Link] is a Scatterplot?

A scatterplot is a graph that shows the relationship between two numerical variables.

 X-axis → one variable

 Y-axis → another variable
 Each dot on the plot represents one record/data point

Example:
Comparing daily returns of two stocks (ATT and Verizon).
[Link] do we use Scatterplots?
Scatterplots help us visually see:

 If two variables increase together (positive correlation)

 If one increases while the other decreases (negative correlation)
 If there is no pattern (no correlation)
 If the relationship is linear or non-linear

Example of Scatterplot
Example: Relationship Between Study Hours and Exam Marks

Suppose we collect data from 8 students about how many hours they studied and the marks they
scored:

Student Study Hours (X) Marks (Y)

1 2 50
2 3 55
3 4 60
4 5 65
5 6 70
6 7 72
7 8 78
8 9 85

If we plot Study Hours on the X-axis and Marks on the Y-axis, each student becomes a point on the
graph.

Interpretation:

 As study hours increase, marks also increase.

 Most points fall along an upward-sloping pattern.
 This shows a positive correlation between study hours and marks.
 There are no cases where a student studied more but scored much lower, so the relationship is
strong.

Conclusion:

A scatterplot of study hours vs. marks indicates a strong positive linear relationship. Students who
study more tend to score higher.

Same Example in R
hours <- c(2,3,4,5,6,7,8,9)
marks <- c(50,55,60,65,70,72,78,85)

plot(hours, marks,
xlab="Study Hours",
ylab="Marks",
main="Scatterplot of Study Hours vs Marks")

Same Example in Python

import [Link] as plt

hours = [2,3,4,5,6,7,8,9]
marks = [50,55,60,65,70,72,78,85]

[Link](hours, marks)
[Link]("Study Hours")
[Link]("Marks")
[Link]("Scatterplot of Study Hours vs Marks")
[Link]()

How to Improve Dense Scatterplots

Later advanced techniques can help:

 Transparency (make points slightly see-through)

 Hexagonal binning (group points into hexagons)
 Density plots (show concentration of points)

These techniques help reveal hidden patterns.

Exploring Two or More Variables

Exploratory Data Analysis (EDA) becomes more meaningful when we study the relationship between
two or more variables. Depending on whether the variables are numeric or categorical, different
techniques are used. When both variables are numeric, visual tools such as scatterplots, hexagonal
binning, and contour plots help us identify patterns. When both variables are categorical, contingency
tables summarize their relationship. When one variable is numeric and the other is categorical,
boxplots and violin plots help us compare distributions across groups. For more than two variables,
faceting allows us to create multiple small plots to compare patterns across subgroups.

1. Hexagonal Binning (Numeric vs Numeric)

When a dataset contains thousands or lakhs of points, a normal scatterplot becomes unreadable.
Hexagonal binning solves this by dividing the plot area into hexagons and counting how many points
fall into each hexagon. Darker hexagons indicate higher data density.

✔ Simple Example

Suppose we want to explore the relationship between

 Finished square feet of homes, and

 Tax-assessed value

In a large housing dataset (over 4 lakh observations), a normal scatterplot becomes a solid black cloud.
Hexagonal binning clearly shows:

 A strong positive relationship between square feet and value

 Additional ―bands‖ representing houses valued higher than normal (premium locations)
✔ R Code
ggplot(kc_tax0, aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex(color='white') +
theme_bw() +
scale_fill_gradient(low='white', high='black') +
labs(x='Finished Square Feet', y='Tax-Assessed Value')

✔ Python Code
ax = kc_tax0.[Link](
x='SqFtTotLiving', y='TaxAssessedValue',
gridsize=30, figsize=(5,4)
)
ax.set_xlabel("Finished Square Feet")
ax.set_ylabel("Tax-Assessed Value")

2. Contour Plots (Density Plots)

Contour plots show the density of two numeric variables. They act like a topographical map, where
closer contour lines indicate more data points.

✔ Example

Using the same housing dataset, a contour plot shows:

 A main cluster where most homes fall

 A secondary cluster representing high-value areas
 Smooth curves showing density instead of individual points

✔ R Code
ggplot(kc_tax0, aes(SqFtTotLiving, TaxAssessedValue)) +
theme_bw() +
geom_point(alpha=0.1) +
geom_density2d(color='white') +
labs(x='Finished Square Feet', y='Tax-Assessed Value')

✔ Python Code
[Link](kc_tax0.SqFtTotLiving, kc_tax0.TaxAssessedValue)

3. Contingency Tables (Categorical vs Categorical)

A contingency table shows the counts of two categorical variables.

✔ Example

In the Lending Club dataset, each loan has:

 A grade (A to G)
 A status (Fully Paid, Current, Late, Charged Off)

A contingency table shows:

 High-grade loans (A, B) → fewer defaults

 Low-grade loans (F, G) → many defaults

This helps identify risk levels.

✔ R Code
library(descr)
CrossTable(lc_loans$grade, lc_loans$status,
prop.c=FALSE, [Link]=FALSE, prop.t=FALSE)

✔ Python Code
crosstab = lc_loans.pivot_table(
index='grade', columns='status',
aggfunc=lambda x: len(x), margins=True
)

4. Boxplots (Numeric vs Categorical)

A boxplot compares a numeric variable across categories. It shows:

 Median
 Quartiles
 Overall spread

✔ Example

Consider the percentage of airline delays for different airlines. A boxplot shows:

 Alaska Airlines has the lowest delay percentage

 American Airlines has the highest
 Variability differs between airlines

✔ R Code
boxplot(pct_carrier_delay ~ airline, data=airline_stats,
ylim=c(0, 50))

✔ Python Code
ax = airline_stats.boxplot(by='airline', column='pct_carrier_delay')

5. Violin Plots
Boxplots show summary statistics, but violin plots show the full density distribution. A violin plot is
useful when the distribution is skewed or multi-peaked.

✔ Example

In the airline delay dataset:

 The violin plot shows that Alaska has a high concentration of days with very low delays
 This density pattern is not clear in the boxplot

✔ R Code
ggplot(airline_stats, aes(airline, pct_carrier_delay)) +
geom_violin() + ylim(0, 50)

✔ Python Code
[Link](airline_stats.airline, airline_stats.pct_carrier_delay)
6. Faceting (Comparing Multiple Subgroups)
Faceting creates multiple small plots based on a categorical variable.

✔ Example

In the King County housing dataset, we can compare square footage vs. tax value for different zip
codes. Faceted hexbin plots show:

 Some zip codes have much higher home values

 Other zip codes have lower values even for similar-sized houses

This reveals geographic pricing patterns.

✔ R Code
ggplot(subset(kc_tax0, ZipCode %in% c(98188,98105,98108,98126)),
aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex() +
facet_wrap('ZipCode')

✔ Python Code
g = [Link](kc_tax_zip, col='ZipCode', col_wrap=2)
[Link]([Link], 'SqFtTotLiving', 'TaxAssessedValue')

Examples for Exploring Binary and Categorical Data

Example 1: Binary Variable — Yes/No Survey

Suppose a survey of 200 students asks whether they use public transport (Yes or No).

 120 say ―Yes‖, 80 say ―No‖.

Here, the binary variable is “Public Transport Use” (Yes / No).
 Proportion using public transport = 120/200 = 0.60 → 60%
 Proportion not using = 40%

This tells us at a glance: 60% of students use public transport.

This is an example of summarizing a binary variable using proportion/percentage.

Example 2: Categorical Variable with Multiple Categories + Bar Chart

Suppose you record the favourite fruit of 150 people. The options are Apple, Banana, Mango, Orange,
Others. The results:

Fruit Count
Apple 50
Banana 30
Mango 45
Orange 20
Others 5
You can summarise this categorical data by converting counts to percentages, e.g.:

 Apple: 33.3%
 Banana: 20%
 Mango: 30%
 Orange: 13.3%
 Others: 3.3%

To visualise, a bar chart works well: x-axis = Fruit names; y-axis = percentage or count.

This example shows how you explore a categorical variable with more than two categories.

Mode: Here, Apple is the mode (most frequent fruit).

Example 3: Categorical Data Representing Numeric Values → Expected Value

Suppose a software company offers subscription plans:

 Premium Plan — ₹500/month — 10% of customers choose this

 Standard Plan — ₹200/month — 30% choose this
 No Plan — ₹0 — 60% choose this

We want to find the expected monthly revenue per customer.

Compute:

So expected revenue per customer = ₹110/month.

This shows how categorical choices mapped to numeric values can be summarised via expected value.
Examples for Exploring Two or More Variables
Here we deal with relationships between variables — numeric–numeric, categorical–categorical,
numeric–categorical, or multivariate.

Example 1: Numeric vs Numeric → Scatterplot / Hexagonal Binning / Contour

Scenario: You have data of 500 houses with their ―Area (in sq ft)‖ and ―Market Price (₹)‖.

If you plot a scatterplot (Area on X-axis, Price on Y-axis), you might see a cloud of points. It may
show a general upward trend (bigger area → higher price), but many points may overlap.

If the dataset were 50,000 houses, scatterplot would be messy. Instead: use hexagonal binning. The
plot will shade hexagons with many points darker. You may observe a clear positive relationship: as
area increases, price increases. You might also see a ―dense band‖ of standard-priced houses and a
second band of premium-priced houses (e.g. sea-facing or luxury homes) — which regular scatterplot
might hide.
Alternatively, a contour plot will show density contours: the densest region (mode) where most
houses lie, lighter regions for expensive or small homes. This makes patterns visible even with many
data points.

Example 2: Categorical vs Categorical → Contingency Table

Scenario: A university wants to study relationship between stream of study (Science, Commerce,
Arts) and whether student is placed after internship (Placed, Not Placed). They survey 300 students:

Stream Placed Not Placed Total

Science 80 20 100
Commerce 60 40 100
Arts 30 70 100

From this contingency table, one can observe:

 Placement rate for Science: 80%

 For Commerce: 60%
 For Arts: 30%

This shows that students from Science stream have higher placement rates than those from Arts. This
is how categorical–categorical relationships are explored.

Example 3: Numeric vs Categorical → Boxplot / Violin Plot

Scenario: A company records the working hours per day of employees and their department: HR,
Engineering, Sales. They have data for 300 employees.

To compare distribution of working hours across departments:

 Draw a boxplot: x-axis = Department, y-axis = Hours worked

o You may see that Engineering has a higher median working hour, wider spread; Sales
has lower median but some outliers (people working long hours).
 To see the full distribution (density), draw a violin plot.
o The violin plot might reveal that HR hours are tightly clustered around 8 hours,
Engineering shows wide variation (some work 6–10 hours, some 12–14 hours), and
Sales show two peaks (morning shift and evening shift).

This helps compare groups — something mean/SD cannot show clearly.

Example 4: Multivariate / Faceting — Comparing Subgroups

Scenario: A real-estate dataset contains house size (sq ft), price (₹), and city zone (Zone A, Zone B,
Zone C).

If you plot size vs price for all zones together, you may see a general positive relationship — but zone-
wise differences may be hidden.
By using faceting, you create a separate plot for each zone:

 Zone A: houses show very high prices for a given size (premium area)
 Zone B: moderate prices
 Zone C: lower prices

Faceting reveals that location influences price strongly, besides size. Thus, multivariate analysis helps
uncover deeper patterns.

Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
121 pages
Exploratory Data Analysis - Part 1
No ratings yet
Exploratory Data Analysis - Part 1
21 pages
Statistics for AI and Data Science Course
No ratings yet
Statistics for AI and Data Science Course
374 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
75 pages
ISOM 670 - 1 Exploratory Data Analysis
No ratings yet
ISOM 670 - 1 Exploratory Data Analysis
48 pages
Proximity Measure for Binary Attributes
No ratings yet
Proximity Measure for Binary Attributes
67 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
51 pages
Killer Diseases: A Beginner's Guide To The Greatest Threats To Our Health Full
100% (1)
Killer Diseases: A Beginner's Guide To The Greatest Threats To Our Health Full
96 pages
Data Science Fundamentals Explained
No ratings yet
Data Science Fundamentals Explained
51 pages
Data Analysis in AI Statistics
No ratings yet
Data Analysis in AI Statistics
53 pages
Understanding Data Objects & Types
No ratings yet
Understanding Data Objects & Types
64 pages
Understanding Analytical Decision Making
No ratings yet
Understanding Analytical Decision Making
27 pages
Data Analysis Textbook
No ratings yet
Data Analysis Textbook
56 pages
Understanding Data Analytics Basics
No ratings yet
Understanding Data Analytics Basics
30 pages
Data Mining Concepts and Techniques Guide
No ratings yet
Data Mining Concepts and Techniques Guide
65 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
53 pages
Exploratory Data Analysis in Research
No ratings yet
Exploratory Data Analysis in Research
32 pages
Object.assign vs Spread Operator Explained
No ratings yet
Object.assign vs Spread Operator Explained
30 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
43 pages
Statistical Machine Learning Study Notes
No ratings yet
Statistical Machine Learning Study Notes
26 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
69 pages
Understanding Structured Data Types
No ratings yet
Understanding Structured Data Types
78 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Understanding Data and Its Types
No ratings yet
Understanding Data and Its Types
51 pages
Understanding Data Types and Quality
No ratings yet
Understanding Data Types and Quality
66 pages
Intro To EDA
No ratings yet
Intro To EDA
30 pages
Understanding Data Types and Structures
No ratings yet
Understanding Data Types and Structures
9 pages
Module - 2 Exploratory Data Analysis
No ratings yet
Module - 2 Exploratory Data Analysis
24 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
90 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
64 pages
Introduction to Big Data and R
No ratings yet
Introduction to Big Data and R
65 pages
Understanding Data Models and Visualization
No ratings yet
Understanding Data Models and Visualization
59 pages
CITS2402: Data Science Foundations
No ratings yet
CITS2402: Data Science Foundations
32 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
46 pages
Types and Features of Data Objects
No ratings yet
Types and Features of Data Objects
35 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
25 pages
Data Mining: Understanding Your Data
No ratings yet
Data Mining: Understanding Your Data
62 pages
EDA and Descriptive Statistics Guide
No ratings yet
EDA and Descriptive Statistics Guide
40 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
56 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
44 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
44 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
26 pages
Introduction to Exploratory Data Analysis
No ratings yet
Introduction to Exploratory Data Analysis
40 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
35 pages
Understanding Machine Learning Data
No ratings yet
Understanding Machine Learning Data
27 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Mining: Preprocessing & Attributes
No ratings yet
Data Mining: Preprocessing & Attributes
65 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
99 pages
Understanding Descriptive Analytics Data
No ratings yet
Understanding Descriptive Analytics Data
46 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
53 pages
Data Analysis Using R Notes
No ratings yet
Data Analysis Using R Notes
52 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
22 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Data Mining and Application
No ratings yet
Data Mining and Application
19 pages
Understanding Data Types and Formats
No ratings yet
Understanding Data Types and Formats
19 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Preparation and Exploration Guide
No ratings yet
Data Preparation and Exploration Guide
12 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Sustainable PET Waste Recycling Process
No ratings yet
Sustainable PET Waste Recycling Process
69 pages
LipiScan Network Setup Guide
No ratings yet
LipiScan Network Setup Guide
18 pages
Routing Protocols: Static vs Dynamic Analysis
No ratings yet
Routing Protocols: Static vs Dynamic Analysis
3 pages
FINRA Testbank: 984 Questions & Resources
No ratings yet
FINRA Testbank: 984 Questions & Resources
5 pages
8086 Assembly Language Lab Guide
No ratings yet
8086 Assembly Language Lab Guide
16 pages
Evaluation of Laboratory Consumables
100% (2)
Evaluation of Laboratory Consumables
6 pages
Utility Functions and Indifference Curves
No ratings yet
Utility Functions and Indifference Curves
6 pages
Understanding Motor Types and Applications
No ratings yet
Understanding Motor Types and Applications
61 pages
(Xpertz) Mockboard Exam HGE Nov 2022
No ratings yet
(Xpertz) Mockboard Exam HGE Nov 2022
8 pages
Overview of Welding Techniques and Types
No ratings yet
Overview of Welding Techniques and Types
65 pages
Molecular Theory of Surface Tension
No ratings yet
Molecular Theory of Surface Tension
21 pages
Examples of Sequence Limits
No ratings yet
Examples of Sequence Limits
27 pages
Indices and Logarithms Question Set
100% (1)
Indices and Logarithms Question Set
4 pages
Nanotechnology in Solar Cell Efficiency
No ratings yet
Nanotechnology in Solar Cell Efficiency
1 page
Baseband vs. Broadband Transmission Explained
No ratings yet
Baseband vs. Broadband Transmission Explained
3 pages
SASMO Grade 3 Sample Questions
100% (2)
SASMO Grade 3 Sample Questions
5 pages
A19BAC & A28AA Thermostat Installation Guide
No ratings yet
A19BAC & A28AA Thermostat Installation Guide
6 pages
Understanding Relational Database Design
No ratings yet
Understanding Relational Database Design
3 pages
8255 PPI and ADC/DAC Interfacing Guide
No ratings yet
8255 PPI and ADC/DAC Interfacing Guide
24 pages
Trustability Board Mount Pressure Sensors: Datasheet
No ratings yet
Trustability Board Mount Pressure Sensors: Datasheet
33 pages
Pasquill-Gifford Dispersion Model Analysis
100% (1)
Pasquill-Gifford Dispersion Model Analysis
36 pages
Assertion and Reason Chemistry Questions
No ratings yet
Assertion and Reason Chemistry Questions
6 pages
Aden - Kerker. Scattering Efficiency For A Layered Sphere. 1951
100% (2)
Aden - Kerker. Scattering Efficiency For A Layered Sphere. 1951
6 pages
Organic Mirror Gallery for Interiors
No ratings yet
Organic Mirror Gallery for Interiors
36 pages
MANET Routing Protocols Overview
No ratings yet
MANET Routing Protocols Overview
74 pages
TECO 7300CV Inverter User Manual
No ratings yet
TECO 7300CV Inverter User Manual
131 pages
Verdi's Requiem: Fugal Analysis
100% (2)
Verdi's Requiem: Fugal Analysis
16 pages
Subwavelength Resonators for Nonlinear Optics
No ratings yet
Subwavelength Resonators for Nonlinear Optics
6 pages
Total Solids Calculation in Milk
No ratings yet
Total Solids Calculation in Milk
46 pages
Hamiltonian Operator in Quantum Mechanics
No ratings yet
Hamiltonian Operator in Quantum Mechanics
10 pages