Module 1 (Exploratory Data Analysis)
Introduction to Exploratory Data Analysis (EDA) Meaning:
• Exploratory Data Analysis (EDA) is the first and most important step in
any data science project.
• It involves summarizing, visualizing, and understanding the main
features of data before applying formal statistical or machine learning
models.
Background and Origin
• Classical statistics mainly focused on statistical inference — i.e.,
drawing conclusions about large populations from small samples
using complex mathematical procedures.
• In 1962, John W. Tukey proposed a new scientific discipline called
“Data Analysis” in his paper “The Future of Data Analysis.”
• Tukey’s idea was to include data exploration as a key component of
statistics, not just inference.
• He connected statistics with engineering and computer science, even
coining the terms:
◦ Bit for binary digit
◦ Software
Tukey’s Contribution
• In 1977, Tukey published the classic book “Exploratory Data Analysis.”
• He introduced simple yet powerful techniques to understand data:
◦ Boxplots
◦ Scatterplots
◦ Summary statistics like mean, median, and quantiles
• His approach emphasized visual understanding of data rather than
relying only on formulas.
Modern Development of EDA
• With advanced computing power and data analysis software (R,
Python, etc.), EDA has evolved beyond its original scope.
• Key drivers of modern EDA:
1. Rapid technology development
2. Availability of large and complex datasets
3. Increased use of data-driven decisions in business,
research, and engineering
David Donoho’s View
• David Donoho, a professor at Stanford and former student of
Tukey, emphasized Tukey’s lasting impact.
• In his 2015 paper, Donoho showed that modern data science traces its
origins to Tukey’s pioneering work in data analysis.
Elements of Structured Data
1. Meaning of Data
• Data comes from various sources: sensors, text, images, videos,
events, and more.
• The Internet of Things (IoT) generates continuous data streams.
• Most raw data is unstructured (e.g., images = pixels, text = words).
• For statistical analysis, unstructured data must be converted into a
structured format.
2. Structured Data
• Structured data is organized in rows and columns (like a
table or spreadsheet).
• This structure allows easy storage, retrieval, and analysis.
• Common examples: data from relational databases or CSV files.
3. Types of Structured Data
A. Numeric Data
• Represent quantities or numbers.
• Two forms:
◦ Continuous: can take any value within an interval
Example: temperature, wind speed, time duration.
◦ Discrete: take only integer or count values
Example: number of students, number of calls received.
B. Categorical Data
• Represent fixed sets of possible categories.
• Examples: gender, product type, state name.
• Subtypes:
◦ Binary Data: only two possible values (e.g., 0/1, Yes/No, True/False).
◦ Ordinal Data: categories have a meaningful order
Example: rating scale (poor, good, excellent) or numeric ranks (1, 2,
3, 4, 5).
4. Importance of Data Types
• Helps decide:
◦ Which visualization to use (e.g., histogram, bar chart).
◦ Which statistical model or algorithm to apply.
• Data science tools like R and Python use data types to:
◦ Improve computational efficiency.
◦ Control how data is handled in analysis.
5. Handling Data in
Software In R:
• [Link]()automatically converts text columns into factors (categorical
variables).
• Assigning new values outside predefined categories causes a warning or
NA
(missing value).
In Python (pandas):
• Text columns remain as text by default.
• You can explicitly define columns as categorical using:
pd.read_csv('[Link]', dtype={'column_name': 'category'})
6. Key Advantages of Identifying Data Types
• Tells the software how to process and visualize data.
• Improves storage efficiency and indexing (as in databases).
• Ensures only valid values are entered (like enums).
7. Key Terms Summary
Term Definition /
Example
Numeric Data
Numbers (continuous or
discrete)
Any value in a range (e.g.,
Continuous
time, weight)
Counted numbers (e.g.,
Discrete
number of students)
Fixed set of values (e.g.,
Categorical
colors, countries)
Binary Two possible values (Yes/No, 0/1)
Ordered categories (Low, Medium,
Ordinal
High)
Rectangular Data
1. Meaning
• Rectangular data refers to a two-dimensional table (rows × columns).
• It is the standard format used in data analysis and machine learning.
• Example: Excel sheet, CSV file, or database table.
2. Structure of Rectangular Data
• Rows Records (observations, samples, instances)
• Columns Variables or Features (attributes, predictors)
• Each cell contains a value for a specific record and variable.
In R and Python, this structure is called a Data Frame.
3. Common Terms
Term Meaning / Synonyms
Data Frame Rectangular data structure used in R and Python
Feature A column in the data (also called variable, predictor, or input)
Outcome / The variable to be predicted (also called dependent variable,
Target response, or output)
Record A row in the data (also called case, instance, or observation)
4. Example of a Data Frame
Category Currency Seller Rating Duration EndDay ClosePrice OpenPri ce Competitiv e?
Music US 3249 5 Mon 0.01 0.01 0
Automot
US 3115 7 Tue 0.01 0.01 1
i ve
Explanation:
• The table includes both numeric data (e.g., price, duration) and
categorical data (e.g., category, currency).
• The “Competitive?” column is a binary variable (0 = No, 1 =
Yes) representing the outcome.
5. Data Frames and Indexes
In Python (pandas):
• Data stored as a DataFrameobject.
• Each row has an automatic index (0, 1, 2, ...).
• You can create multi-level indexes to speed up operations.
In R:
• Data stored as a [Link].
• Rows are indexed by position.
• Packages like [Link] dplyrprovide faster data manipulation and
support multi-indexing.
6. Terminology Differences
• Statisticians say:
◦ Predictor variables predict a response variable.
• Data scientists say:
◦ Features predict a target.
• Note:
◦ “Sample” means one record to a computer scientist, but a
collection of records to a statistician.
7. Non-Rectangular Data Structures
Not all data is rectangular. Some important types are:
a. Time Series Data
• Records measurements of a variable over time.
• Used in forecasting, finance, IoT devices, etc.
• Example: daily temperature readings.
b. Spatial Data
• Related to geographical or spatial locations.
• Example: house price based on location coordinates.
• Represented as either:
◦ Object view: each item has coordinates (e.g., a house on a map).
◦ Field view: data for every small region (e.g., pixel brightness).
c. Graph or Network Data
• Represents connections or relationships between entities.
• Examples:
◦ Social network (people connected by friendships).
◦ Transportation network (cities connected by roads).
• Used in network analysis and recommendation systems.
Estimates of Location
1. Meaning
• Location (Central Tendency) shows where the center or “typical” value
of a dataset lies.
• It gives an idea of what is normal or expected for a variable.
• Common measures: Mean, Median, Weighted Mean, Trimmed Mean.
2. Key Terms
Synonyms / Notes
Term Meaning / Definition
Average value (sum of all values ÷ number
Mean Average
of values)
Weighted Average giving importance (weight) to each Weighted
Mean value Average
Median Middle value when data is sorted 50th Percentile
Percentile Value below which a certain % of data lies Quantile
Trimmed Mean Mean after removing extreme high & low Truncated
values Mean
Robust Not affected much by outliers Resistant
Unusually high or low value compared to
Outlier Extreme Value
others
3. Mean (Arithmetic Mean)
Where
• (x_i) = individual values
• (n) = number of observations
Example:
Data = {3, 5, 1, 2}
Mean = {{3 + 5 + 1 + 2}/4}= 2.75
Used when data is normally distributed (no extreme values).
4. Trimmed Mean
• Definition: Mean computed after removing a fixed percentage of lowest
and highest values.
• Removes the effect of outliers.
Useful when data has outliers — e.g., exam scores or sports judging (diving,
gymnastics).
5. Weighted Mean
• Definition: Mean that gives different importance (weights) to data points.
Example:
Value (x) Weight (w)
10 1
20 2
30 3
Used when data from groups or sensors have different accuracy or importance.
There are two main reasons to use a weighted mean:
1. Different accuracy: Some values are less reliable (like readings from a
less accurate sensor), so we give them less weight when finding the
average.
2. Unequal representation: Sometimes, the data doesn’t equally represent
all groups. To fix that, we give more weight to the groups that have fewer
samples so the overall result is fair.
6. Median
• Definition: The middle value when data is arranged in ascending order.
• If n is odd: middle value.
• If n is even: average of two middle values.
Example:
Data = {1, 2, 4, 5, 6} 4
Data = {1, 2, 4, 5} 4) / 2 = 3
Median is robust to outliers (not affected by extreme values).
Example: Average income in a city — use median instead of mean because one
billionaire can distort the average.
7. Weighted Median
• Each value has a weight (importance).
• The median is the point where half the total weight lies below and half
above.
Used when sample sizes or group importance differ.
A weighted median works like a normal median, but it also considers the
importance (weight) of each value. We sort the data and find the point where
the total weight on both sides is the same. It helps when some data points are
more important than others and is not affected much by extreme values
(outliers).
Example:
Suppose three sensors give temperature readings:
• Sensor A: 20°C (weight 3 — very reliable)
• Sensor B: 25°C (weight 1 — less reliable)
• Sensor C: 30°C (weight 2 — moderately reliable)
When we arrange and consider weights, the weighted median will be around
25°C, because that’s the point where the total weight below and above is
balanced.
8. Outliers
• Definition: Data points that are very different from the rest.
• Can occur due to:
◦ Data entry or measurement errors.
◦ Genuine rare events (e.g., Bill Gates’ income).
• Impact: Strongly affects mean but not median.
• Outliers should be investigated and handled carefully.
Estimates of Variability
1. Meaning
• Variability (Dispersion) tells how spread out or consistent the data
values are.
• It measures how far values deviate from the center (mean or median).
• High variability values differ widely.
• Low variability values are close together.
Location = “where” data is centered
Variability = “how much” data varies around that center
2. Key Terms
Term Definition Synonyms /
Notes
Difference between a value and
Deviation Error, Residual
the mean
Average of squared deviations Mean-squared
Variance
from the mean error
Standard Deviation (SD) Square root of variance Spread measure
Mean Absolute Deviation L₁-norm,
Average of absolute deviations
(MAD₁) Manhattan
norm
Median Absolute Median of absolute deviations
Robust to outliers
Deviation (MADₘ) from the median
Difference between largest and Simple but
Range
smallest value sensitive
Difference between 75th and Spread of middle
Interquartile Range
25th percentiles 50%
(IQR)
Order Statistics Values after sorting data Ranks, Quantiles
3. Measuring Variability – Step-by-Step
a) Mean Absolute Deviation (MAD₁)
Simple to compute
Not commonly used in theory (uses absolute values, hard to manipulate
mathematically)
b) Variance (s²)
Measures how far data points deviate from
mean Sensitive to outliers (since deviations are
squared)
c) Standard Deviation (s)
Easier to interpret (same unit as original data)
Why Divide by (n–1)?
• Using n–1 gives an unbiased estimate of population variance.
• “n–1” is called degrees of freedom (because one constraint — the mean
— is already used).
• For large n, difference between n and n–1 is negligible.
d) Median Absolute Deviation (MADₘ) — Robust Measure
Resistant to outliers
Good for skewed or non-normal data
Sometimes multiplied by 1.4826 to make it comparable to SD for normal
distributions.
e) Range
Range = Max - Min
Example:
Data = {2, 6, 9, 15}
Range = 15 – 2 = 13
Very simple
Highly sensitive to outliers (only uses two values)
f) Percentiles and Interquartile Range
(IQR) Percentile:
Value below which a certain % of data falls.
• 25th percentile = Q1
• 75th percentile = Q3
• 50th percentile = Median
IQR Formula:
IQR = Q3 - Q1
Example:
Data (sorted) = {1, 2, 3, 3, 5, 6, 7, 9}
• Q1 = 2.5
• Q3 = 6.5
IQR = 6.5 - 2.5 = 4
Measures spread of the middle 50% of data
Not affected by outliers
Exploring the Data Distribution
1. Meaning
• After finding measures of location and variability, we explore the overall
shape and pattern of the data.
• Helps to identify:
◦ Symmetry or skewness
◦ Outliers
◦ Tails (extreme values)
◦ Clusters or gaps
Visualization = most effective way to understand data distribution.
So far, we’ve looked at ways to describe data using a single number — like the
mean, median, or standard deviation — to show the center or spread of the data.
But it’s also important to look at how the data values are spread out overall,
that is, the shape or pattern of the entire distribution, not just one summary
number.
2. Key Terms
Term Definition / Purpose
Boxplot Visualizes distribution using quartiles (Tukey’s method)
Frequency
Groups data into intervals (bins) and counts occurrences
Table
Histogram Graph of frequency table; shows data spread visually
Smooth curve showing data distribution (probability
Density Plot
density)
3. Percentiles and Boxplots
Percentiles
• A percentile indicates the value below which a given percentage of data lies.
◦ 25th percentile (Q1): 25% of data below
◦ 50th percentile: median
◦ 75th percentile (Q3): 75% of data below
The median is 4 murders per 100,000 people, although there is quite a bit of
variability: the 5th percentile is only 1.6 and the 95th percentile is 6.51.
Boxplot
Boxplots, introduced by Tukey [Tukey-1977], are based on percentiles and
give a quick way to visualize the distribution of data. Figure 1-2 shows a
boxplot of the pop- ulation by state produced by R:
boxplot(state[['Population']]/1000000, ylab='Population (millions)')
pandas provides a number of basic exploratory plots for data frame; one of
them is boxplots:
ax = (state['Population']/1_000_000).[Link]()
ax.set_ylabel('Population (millions)')
A boxplot helps us quickly understand how the data is spread.
The middle line inside the box shows the median (middle value).
In this example, the median state population is about 5 million.
The box shows where the middle half of the states fall:
between 2 million and 7 million people.
The top of the box is the 75th percentile (upper-middle value).
The bottom of the box is the 25th percentile (lower-middle value).
The whiskers (the dotted or straight lines extending from the box) show the range where
most of the states fall.
Any points outside the whiskers are outliers — states with very high or very low
populations compared to others.
4. Frequency Table
• Divides the data range into equal-width bins and counts observations per bin.
• Shows how data is distributed across
intervals. Example (Population by State):
In R
Simple Explanation of the Histogram Bins
Wyoming has the smallest population (about 5.6 lakh)
California has the largest population (about 3.7 crore)
So the range of populations is:
37,253,956 – 563,626 = 36,690,330
To create a histogram, we divide this big range into equal parts called bins.
If we choose 10 bins, each bin will cover about 3.6 million people.
The first bin covers populations from 563,626 to 4,232,658.
The top bin covers 33,584,923 to 37,253,956, and it contains only California, because
no other state has such a high population.
The bins just below California are empty (no states fall in that population range).
These empty bins are important because they tell us there are no states in that range.
Choosing bin size matters:
o If bins are too big, you hide important patterns.
o If bins are too small, the graph becomes confusing and noisy.
5. Histogram
• A bar-like plot that represents frequencies of bins.
• X-axis: variable values (divided into bins)
• Y-axis: frequency (count or proportion)
Features:
• Bars are contiguous (touch each other)
• Empty bins show gaps (no data there)
• Number of bins affects clarity
◦ Too few oversimplified
Helps identify skewness and outliers.
6. Density Plot
• A smooth version of a histogram showing continuous data distribution.
• Uses Kernel Density Estimation (KDE) to draw the curve.
• Area under the curve = 1 (represents 100% of the data).
Example (Murder Rate Density Plot):
In R
hist(state[['[Link]']], freq=FALSE)
lines(density(state[['[Link]']]), lwd=3, col='blue')
In Python
ax = state['[Link]'].[Link](density=True, xlim=[0,12], bins=range(1,12))
state['[Link]'].[Link](ax=ax)
ax.set_xlabel('Murder Rate (per 100,000)')
Shows peaks, tails, and skewness more smoothly than histogram.
Controlled by bandwidth (bw) — small bw = more detail, large bw = smoother.
7. Comparison of Graphs
Graph Type Data Type Shows Best For
Median, Quick summary
Boxplot Numeric
Quartiles, comparison
Outliers
Histogram Numeric Frequency of bins Shape of distribution
Density Plot Numeric Smooth distribution Continuous data
curve
Frequency Numeric/
Tabular count Precise frequency info
Table Categorical
Exploring Binary and Categorical Data
1. What is Categorical Data?
Categorical data means data that represents groups or labels, not numbers.
Examples:
Gender (Male/Female)
Type of delay (Weather, Security, ATC, Inbound)
Blood group (A, B, O)
Binary data is a special type of categorical data with only two values: Yes/No, 0/1, True/False.
2. How to Summarize Categorical or Binary Data?
✔ Just use proportions or percentages
Example:
At Dallas/Fort Worth Airport (DFW), delays since 2010 are caused by:
Cause % of Delays
Carrier 23.02%
ATC 30.40%
Weather 4.03%
Security 0.12%
Inbound aircraft 42.43%
This tells us the largest cause of delays is Inbound aircraft (42.43%).
3. Bar Charts (Best for Categorical Data)
✔Why bar charts?
Show how many times each category occurs
Easy to compare categories
X-axis = category names
Y-axis = count or percentage
Bars do NOT touch each other
✔R code
barplot([Link](dfw) / 6, [Link]=0.8, [Link]=0.7,
xlab='Cause of delay', ylab='Count')
✔Python Pandas code
ax = [Link]().[Link](figsize=(4, 4), legend=False)
ax.set_xlabel('Cause of delay')
ax.set_ylabel('Count')
4. Bar Chart vs Histogram
Feature Bar Chart Histogram
X-axis shows Categories Numeric ranges (bins)
Bars Separated Touch each other
Used for Categorical data Numerical data
5. Pie Charts
Pie charts show proportions in a circle.
But statisticians avoid them because:
Hard to compare slice sizes
Not very accurate visually
Bar charts are a better choice.
6. Mode (Most Frequent Category)
✔ Definition
The mode is the category that appears most often.
�Examples:
At DFW Airport, the mode of delay causes = Inbound (highest percentage).
In the U.S., the mode for religion might be Christian.
Mode is mainly used for categorical data, not numeric data.
7. Expected Value (Weighted Average for Categorical Choices)
Sometimes categorical choices have numeric values attached.
�Example:
A cloud service has:
5% customers choosing ₹300 plan
15% choosing ₹50 plan
80% choosing nothing (₹0)
8. Why Expected Value is Important?
Expected value is used in:
Business decision making
Profit estimation
Risk analysis
Capital budgeting
Predicting future gains or losses
Examples:
Expected five-year profit of a new startup
Expected cost savings of a new hospital software
Expected returns from an investment
Probability
✔ What is Probability?
Probability tells us how likely something is to happen.
You see probability in everyday life:
Weather forecast → ―40% chance of rain‖
Sports → ―Team has 70% chance of winning‖
✔ Probability and Odds
Sometimes probability is expressed as odds, especially in sports and games.
Example:
Odds of winning = 2 to 1
So, probability of winning = 2/3 = 66.7%
✔ Formal meaning
Formally, probability means:
If you repeat an event many, many times, the probability is the proportion of times it would
happen.
Example:
If you flip a coin thousands of times:
Heads will come about 50%
Tails will come about 50%
Even though each individual flip is uncertain, over many flips, the proportion settles near 0.5.
This idea is enough for most real-world use.
Definition:
Probability = Number of times an event happens ÷ Number of times you try the experiment (if
repeated many times).
Correlation
[Link] is Correlation?
Correlation tells us how strongly two variables move together.
Positive correlation:
When X increases, Y also increases.
Example → Hours studied ↑, marks ↑
Negative correlation:
When X increases, Y decreases.
Example → Speed ↑, travel time ↓
Zero correlation:
No relationship.
[Link] of Perfect Positive Correlation
v1 = {1, 2, 3}
v2 = {4, 5, 6}
Both increase together → perfect correlation.
[Link] Simple ―Sum of Products‖ is Not Useful
Multiply each pair:
1×4 + 2×5 + 3×6 = 32
If you shuffle values in v2, this sum will never be more than 32.
But this number alone is not meaningful.
We need a measure that:
✔ Is standardized
✔ Lies between –1 and +1
✔ Works for all scales
So we use Pearson’s correlation coefficient.
[Link]’s Correlation Coefficient
To compute Pearson’s correlation coefficient, we multiply deviations from the mean for
variable 1 times those for variable 2, and divide by the product of the standard
deviations:Result is always between:
+1 → perfect positive
0 → no relationship
–1 → perfect negative
We divide by n–1 (not n) because we are using sample standard deviation, which adjusts for degrees
of freedom.
[Link] is Only for Linear Relationships
If the relationship is curved or non-linear, correlation does not work well.
Example:
Tax rate vs revenue:
At first, tax increases → revenue increases
After a point, tax increases → people avoid taxes → revenue decreases
This is not a straight-line relationship → correlation is misleading.
[Link] Matrix
A correlation matrix shows the correlation between each pair of variables.
Example:
Stocks A, B, C
You find correlation between:
A and A
A and B
A and C
B and B
B and C
C and C
The diagonal always has 1 (correlation of a variable with itself).
You will also see repeated values above and below the diagonal → matrix is symmetrical.
SIMPLE Example of Correlation Matrix
Suppose you have data
Day Stock A Stock B Stock C
1 10 20 30
2 12 22 28
3 14 24 27
Step 1: Calculate correlation between every pair:
Corr(A, B)
Corr(A, C)
Corr(B, C)
Use Pearson formula (or software).
Final matrix:
A B C
A 1.0 0.99 –0.3
B 0.99 1.0 –0.2
C –0.3 –0.2 1.0
Values here are just examples.
How to Compute a Correlation Matrix in Python
import pandas as pd
data = {
'A': [10, 12, 14],
'B': [20, 22, 24],
'C': [30, 28, 27]
}
df = [Link](data)
corr_matrix = [Link]()
print(corr_matrix)
How to Compute a Correlation Matrix in R
data <- [Link](
A = c(10, 12, 14),
B = c(20, 22, 24),
C = c(30, 28, 27)
)
cor_matrix <- cor(data)
print(cor_matrix)
Correlation and Outliers
✔Correlation is sensitive to outliers
Just like mean and standard deviation, the correlation coefficient can be badly affected by outliers.
Example:
If most values follow a pattern, but one value is extremely large or extremely small, that single point
can pull the correlation up or down and give a misleading result.
So:
One extreme value → correlation changes a lot
This means correlation is not robust
Scatterplots
[Link] is a Scatterplot?
A scatterplot is a graph that shows the relationship between two numerical variables.
X-axis → one variable
Y-axis → another variable
Each dot on the plot represents one record/data point
Example:
Comparing daily returns of two stocks (ATT and Verizon).
[Link] do we use Scatterplots?
Scatterplots help us visually see:
If two variables increase together (positive correlation)
If one increases while the other decreases (negative correlation)
If there is no pattern (no correlation)
If the relationship is linear or non-linear
Example of Scatterplot
Example: Relationship Between Study Hours and Exam Marks
Suppose we collect data from 8 students about how many hours they studied and the marks they
scored:
Student Study Hours (X) Marks (Y)
1 2 50
2 3 55
3 4 60
4 5 65
5 6 70
6 7 72
7 8 78
8 9 85
If we plot Study Hours on the X-axis and Marks on the Y-axis, each student becomes a point on the
graph.
Interpretation:
As study hours increase, marks also increase.
Most points fall along an upward-sloping pattern.
This shows a positive correlation between study hours and marks.
There are no cases where a student studied more but scored much lower, so the relationship is
strong.
Conclusion:
A scatterplot of study hours vs. marks indicates a strong positive linear relationship. Students who
study more tend to score higher.
Same Example in R
hours <- c(2,3,4,5,6,7,8,9)
marks <- c(50,55,60,65,70,72,78,85)
plot(hours, marks,
xlab="Study Hours",
ylab="Marks",
main="Scatterplot of Study Hours vs Marks")
Same Example in Python
import [Link] as plt
hours = [2,3,4,5,6,7,8,9]
marks = [50,55,60,65,70,72,78,85]
[Link](hours, marks)
[Link]("Study Hours")
[Link]("Marks")
[Link]("Scatterplot of Study Hours vs Marks")
[Link]()
How to Improve Dense Scatterplots
Later advanced techniques can help:
Transparency (make points slightly see-through)
Hexagonal binning (group points into hexagons)
Density plots (show concentration of points)
These techniques help reveal hidden patterns.
Exploring Two or More Variables
Exploratory Data Analysis (EDA) becomes more meaningful when we study the relationship between
two or more variables. Depending on whether the variables are numeric or categorical, different
techniques are used. When both variables are numeric, visual tools such as scatterplots, hexagonal
binning, and contour plots help us identify patterns. When both variables are categorical, contingency
tables summarize their relationship. When one variable is numeric and the other is categorical,
boxplots and violin plots help us compare distributions across groups. For more than two variables,
faceting allows us to create multiple small plots to compare patterns across subgroups.
1. Hexagonal Binning (Numeric vs Numeric)
When a dataset contains thousands or lakhs of points, a normal scatterplot becomes unreadable.
Hexagonal binning solves this by dividing the plot area into hexagons and counting how many points
fall into each hexagon. Darker hexagons indicate higher data density.
✔ Simple Example
Suppose we want to explore the relationship between
Finished square feet of homes, and
Tax-assessed value
In a large housing dataset (over 4 lakh observations), a normal scatterplot becomes a solid black cloud.
Hexagonal binning clearly shows:
A strong positive relationship between square feet and value
Additional ―bands‖ representing houses valued higher than normal (premium locations)
✔ R Code
ggplot(kc_tax0, aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex(color='white') +
theme_bw() +
scale_fill_gradient(low='white', high='black') +
labs(x='Finished Square Feet', y='Tax-Assessed Value')
✔ Python Code
ax = kc_tax0.[Link](
x='SqFtTotLiving', y='TaxAssessedValue',
gridsize=30, figsize=(5,4)
)
ax.set_xlabel("Finished Square Feet")
ax.set_ylabel("Tax-Assessed Value")
2. Contour Plots (Density Plots)
Contour plots show the density of two numeric variables. They act like a topographical map, where
closer contour lines indicate more data points.
✔ Example
Using the same housing dataset, a contour plot shows:
A main cluster where most homes fall
A secondary cluster representing high-value areas
Smooth curves showing density instead of individual points
✔ R Code
ggplot(kc_tax0, aes(SqFtTotLiving, TaxAssessedValue)) +
theme_bw() +
geom_point(alpha=0.1) +
geom_density2d(color='white') +
labs(x='Finished Square Feet', y='Tax-Assessed Value')
✔ Python Code
[Link](kc_tax0.SqFtTotLiving, kc_tax0.TaxAssessedValue)
3. Contingency Tables (Categorical vs Categorical)
A contingency table shows the counts of two categorical variables.
✔ Example
In the Lending Club dataset, each loan has:
A grade (A to G)
A status (Fully Paid, Current, Late, Charged Off)
A contingency table shows:
High-grade loans (A, B) → fewer defaults
Low-grade loans (F, G) → many defaults
This helps identify risk levels.
✔ R Code
library(descr)
CrossTable(lc_loans$grade, lc_loans$status,
prop.c=FALSE, [Link]=FALSE, prop.t=FALSE)
✔ Python Code
crosstab = lc_loans.pivot_table(
index='grade', columns='status',
aggfunc=lambda x: len(x), margins=True
)
4. Boxplots (Numeric vs Categorical)
A boxplot compares a numeric variable across categories. It shows:
Median
Quartiles
Overall spread
✔ Example
Consider the percentage of airline delays for different airlines. A boxplot shows:
Alaska Airlines has the lowest delay percentage
American Airlines has the highest
Variability differs between airlines
✔ R Code
boxplot(pct_carrier_delay ~ airline, data=airline_stats,
ylim=c(0, 50))
✔ Python Code
ax = airline_stats.boxplot(by='airline', column='pct_carrier_delay')
5. Violin Plots
Boxplots show summary statistics, but violin plots show the full density distribution. A violin plot is
useful when the distribution is skewed or multi-peaked.
✔ Example
In the airline delay dataset:
The violin plot shows that Alaska has a high concentration of days with very low delays
This density pattern is not clear in the boxplot
✔ R Code
ggplot(airline_stats, aes(airline, pct_carrier_delay)) +
geom_violin() + ylim(0, 50)
✔ Python Code
[Link](airline_stats.airline, airline_stats.pct_carrier_delay)
6. Faceting (Comparing Multiple Subgroups)
Faceting creates multiple small plots based on a categorical variable.
✔ Example
In the King County housing dataset, we can compare square footage vs. tax value for different zip
codes. Faceted hexbin plots show:
Some zip codes have much higher home values
Other zip codes have lower values even for similar-sized houses
This reveals geographic pricing patterns.
✔ R Code
ggplot(subset(kc_tax0, ZipCode %in% c(98188,98105,98108,98126)),
aes(x=SqFtTotLiving, y=TaxAssessedValue)) +
stat_binhex() +
facet_wrap('ZipCode')
✔ Python Code
g = [Link](kc_tax_zip, col='ZipCode', col_wrap=2)
[Link]([Link], 'SqFtTotLiving', 'TaxAssessedValue')
Examples for Exploring Binary and Categorical Data
Example 1: Binary Variable — Yes/No Survey
Suppose a survey of 200 students asks whether they use public transport (Yes or No).
120 say ―Yes‖, 80 say ―No‖.
Here, the binary variable is “Public Transport Use” (Yes / No).
Proportion using public transport = 120/200 = 0.60 → 60%
Proportion not using = 40%
This tells us at a glance: 60% of students use public transport.
This is an example of summarizing a binary variable using proportion/percentage.
Example 2: Categorical Variable with Multiple Categories + Bar Chart
Suppose you record the favourite fruit of 150 people. The options are Apple, Banana, Mango, Orange,
Others. The results:
Fruit Count
Apple 50
Banana 30
Mango 45
Orange 20
Others 5
You can summarise this categorical data by converting counts to percentages, e.g.:
Apple: 33.3%
Banana: 20%
Mango: 30%
Orange: 13.3%
Others: 3.3%
To visualise, a bar chart works well: x-axis = Fruit names; y-axis = percentage or count.
This example shows how you explore a categorical variable with more than two categories.
Mode: Here, Apple is the mode (most frequent fruit).
Example 3: Categorical Data Representing Numeric Values → Expected Value
Suppose a software company offers subscription plans:
Premium Plan — ₹500/month — 10% of customers choose this
Standard Plan — ₹200/month — 30% choose this
No Plan — ₹0 — 60% choose this
We want to find the expected monthly revenue per customer.
Compute:
So expected revenue per customer = ₹110/month.
This shows how categorical choices mapped to numeric values can be summarised via expected value.
Examples for Exploring Two or More Variables
Here we deal with relationships between variables — numeric–numeric, categorical–categorical,
numeric–categorical, or multivariate.
Example 1: Numeric vs Numeric → Scatterplot / Hexagonal Binning / Contour
Scenario: You have data of 500 houses with their ―Area (in sq ft)‖ and ―Market Price (₹)‖.
If you plot a scatterplot (Area on X-axis, Price on Y-axis), you might see a cloud of points. It may
show a general upward trend (bigger area → higher price), but many points may overlap.
If the dataset were 50,000 houses, scatterplot would be messy. Instead: use hexagonal binning. The
plot will shade hexagons with many points darker. You may observe a clear positive relationship: as
area increases, price increases. You might also see a ―dense band‖ of standard-priced houses and a
second band of premium-priced houses (e.g. sea-facing or luxury homes) — which regular scatterplot
might hide.
Alternatively, a contour plot will show density contours: the densest region (mode) where most
houses lie, lighter regions for expensive or small homes. This makes patterns visible even with many
data points.
Example 2: Categorical vs Categorical → Contingency Table
Scenario: A university wants to study relationship between stream of study (Science, Commerce,
Arts) and whether student is placed after internship (Placed, Not Placed). They survey 300 students:
Stream Placed Not Placed Total
Science 80 20 100
Commerce 60 40 100
Arts 30 70 100
From this contingency table, one can observe:
Placement rate for Science: 80%
For Commerce: 60%
For Arts: 30%
This shows that students from Science stream have higher placement rates than those from Arts. This
is how categorical–categorical relationships are explored.
Example 3: Numeric vs Categorical → Boxplot / Violin Plot
Scenario: A company records the working hours per day of employees and their department: HR,
Engineering, Sales. They have data for 300 employees.
To compare distribution of working hours across departments:
Draw a boxplot: x-axis = Department, y-axis = Hours worked
o You may see that Engineering has a higher median working hour, wider spread; Sales
has lower median but some outliers (people working long hours).
To see the full distribution (density), draw a violin plot.
o The violin plot might reveal that HR hours are tightly clustered around 8 hours,
Engineering shows wide variation (some work 6–10 hours, some 12–14 hours), and
Sales show two peaks (morning shift and evening shift).
This helps compare groups — something mean/SD cannot show clearly.
Example 4: Multivariate / Faceting — Comparing Subgroups
Scenario: A real-estate dataset contains house size (sq ft), price (₹), and city zone (Zone A, Zone B,
Zone C).
If you plot size vs price for all zones together, you may see a general positive relationship — but zone-
wise differences may be hidden.
By using faceting, you create a separate plot for each zone:
Zone A: houses show very high prices for a given size (premium area)
Zone B: moderate prices
Zone C: lower prices
Faceting reveals that location influences price strongly, besides size. Thus, multivariate analysis helps
uncover deeper patterns.