Descriptive Statistics &
Visualization Before Analysis in R
Understand your data before modelling — measures of center, spread, shape, and the essential plots every
analyst needs
Section Topic
1 What is Descriptive Statistics?
2 Measures of Central Tendency
3 Measures of Spread / Variability
4 Shape: Skewness & Kurtosis
5 Five-Number Summary & summary()
6 Why Visualize Before Analysis?
7 Essential Plots — Histogram
8 Essential Plots — Boxplot
9 Essential Plots — Scatter Plot
10 Essential Plots — Bar Chart & Pie Chart
11 ggplot2 — Grammar of Graphics
12 Full Workflow Example
R — Descriptive Statistics & Visualization Before Analysis Page 1
1. What is Descriptive Statistics?
Descriptive statistics summarise and describe the main features of a dataset without making inferences beyond
the data itself. It answers: What does my data look like?
Category What it measures R functions
Central Tendency Where the data is centered mean(), median(), mode()
Spread / Variability How spread out the data is var(), sd(), range(), IQR()
Shape Symmetry and tail heaviness skewness(), kurtosis() (e3071)
Position Relative standing of values quantile(), rank(), scale()
Frequency Count of categories table(), [Link]()
■ Rule: Always run descriptive statistics AND visualize your data BEFORE any modelling or hypothesis testing.
2. Measures of Central Tendency
Mean — Arithmetic Average
Sum of all values divided by n. Sensitive to outliers.
x <- c(10, 20, 30, 40, 50)
mean(x) # 30
mean(x, [Link]=TRUE) # ignores NA values
# Weighted mean
[Link](x, w = c(1,2,3,2,1))
Median — Middle Value
The value that splits the sorted data in half. Robust to outliers — preferred for skewed data.
median(x) # 30
# Median is better than mean when data is skewed
# e.g. income data — a few billionaires inflate the mean
Mode — Most Frequent Value
R has no built-in mode() for data (mode() returns storage type). Use table() instead.
x <- c(1, 2, 2, 3, 3, 3, 4)
names([Link](table(x))) # "3"
# Or using e1071 package
# [Link]("e1071")
R — Descriptive Statistics & Visualization Before Analysis Page 2
library(e1071)
mode_val <- [Link](names(table(x))[[Link](table(x))])
3. Measures of Spread / Variability
Measure Formula / Description R function Notes
range(x);
Range Max − Min Very sensitive to outliers
diff(range(x))
Avg squared deviation from
Variance var(x) In squared units
mean
Std Dev Square root of variance sd(x) Same unit as data
IQR Q3 − Q1 IQR(x) Robust to outliers
MAD Median Abs Deviation mad(x) Most robust
CV sd/mean × 100% sd(x)/mean(x)*100 Unit-free comparison
x <- c(4, 8, 15, 16, 23, 42)
range(x) # 4 42
diff(range(x)) # 38
var(x) # 188.7
sd(x) # 13.74
IQR(x) # 14.25
mad(x) # 10.38
sd(x)/mean(x)*100 # CV = 73.3%
R — Descriptive Statistics & Visualization Before Analysis Page 3
R — Descriptive Statistics & Visualization Before Analysis Page 4
4. Shape: Skewness & Kurtosis
Skewness — Asymmetry of the distribution
Value Interpretation
Skewness < −1 Strongly left-skewed (long left tail)
−1 to −0.5 Moderately left-skewed
−0.5 to +0.5 Approximately symmetric
+0.5 to +1 Moderately right-skewed
Skewness > +1 Strongly right-skewed (long right tail — e.g. income, house prices)
Kurtosis — Tail heaviness
Type Excess Kurtosis Description
Mesokurtic ≈ 0 Normal distribution — baseline
Leptokurtic > 0 Heavy tails, sharp peak — more extreme outliers
Platykurtic < 0 Light tails, flat peak — fewer extreme values
# [Link]("e1071")
library(e1071)
x <- c(2, 4, 4, 4, 5, 5, 7, 9)
skewness(x) # positive = right skewed
kurtosis(x) # excess kurtosis (normal = 0)
# Base R alternative
n <- length(x)
sk <- (sum((x - mean(x))^3)/n) / (sd(x)^3)
5. Five-Number Summary & summary()
The five-number summary gives a complete picture of distribution shape in just 5 values.
R — Descriptive Statistics & Visualization Before Analysis Page 5
Statistic What it is R function
Min Smallest value min(x)
Q1 25th percentile (lower quartile) quantile(x, 0.25)
Median 50th percentile (middle value) median(x)
Q3 75th percentile (upper quartile) quantile(x, 0.75)
Max Largest value max(x)
IQR Q3 − Q1 (middle 50% of data) IQR(x)
x <- c(3, 7, 8, 5, 12, 14, 21, 13, 18)
# Five-number summary
fivenum(x) # Min Q1 Median Q3 Max
quantile(x) # 0% 25% 50% 75% 100%
quantile(x, c(0.1, 0.9)) # 10th and 90th percentile
# Full descriptive summary (most useful function in R)
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 3.0 6.5 12.0 11.2 13.5 21.0
# For a data frame — summarizes ALL columns at once
summary(mtcars)
# psych package — even more detail
# [Link]("psych")
library(psych)
describe(mtcars)
■ summary() is your first stop for every new dataset. Run it before anything else.
R — Descriptive Statistics & Visualization Before Analysis Page 6
6. Why Visualize Before Analysis?
Raw statistics alone can be dangerously misleading. Always plot your data first.
■ Anscombe's Quartet: All 4 datasets share identical mean, variance, correlation and regression line — yet look totally different.
Only a plot reveals the truth.
The EDA (Exploratory Data Analysis) workflow:
Step Action R code
1 Load data and check structure str(df); head(df); dim(df)
2 Run summary statistics summary(df); describe(df)
3 Check missing values sum([Link](df)); colSums([Link](df))
4 Plot distributions (histogram/density) hist(x); plot(density(x))
5 Check for outliers (boxplot) boxplot(x)
6 Explore relationships (scatter/corr) plot(x,y); cor(df)
7 Then proceed to modelling lm(), glm(), [Link](), …
R — Descriptive Statistics & Visualization Before Analysis Page 7
7. Histogram
Shows the distribution of a single continuous variable by dividing it into bins and counting frequencies.
# Base R
hist(mtcars$mpg)
hist(mtcars$mpg, breaks=15, col="steelblue", border="white",
main="Distribution of MPG", xlab="Miles per Gallon")
# Density curve overlay
hist(mtcars$mpg, freq=FALSE, col="lightblue")
lines(density(mtcars$mpg), col="darkblue", lwd=2)
# ggplot2
library(ggplot2)
ggplot(mtcars, aes(x=mpg)) +
geom_histogram(bins=12, fill="steelblue", color="white") +
geom_density(aes(y=after_stat(count)), color="red", lwd=1) +
labs(title="MPG Distribution") + theme_minimal()
■ Use breaks= in base R or bins= in ggplot2 to control bin width. Too few = hides detail. Too many = noisy.
8. Boxplot (Box-and-Whisker Plot)
Displays the five-number summary visually. Excellent for comparing groups and spotting outliers.
# Base R boxplot
R — Descriptive Statistics & Visualization Before Analysis Page 8
boxplot(mpg ~ cyl, data=mtcars,
col=c("lightblue","lightgreen","salmon"),
main="MPG by Cylinders", xlab="Cylinders", ylab="MPG")
# ggplot2 boxplot
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_boxplot() +
geom_jitter(width=0.1, alpha=0.4) + # show actual points
labs(title="MPG by Cylinders") + theme_minimal()
R — Descriptive Statistics & Visualization Before Analysis Page 9
9. Scatter Plot
Reveals the relationship between two continuous variables. Essential before computing correlation.
# Base R scatter
plot(mtcars$wt, mtcars$mpg,
xlab="Weight", ylab="MPG",
main="Weight vs MPG", pch=19, col="steelblue")
abline(lm(mpg~wt, data=mtcars), col="red", lwd=2)
# Correlation coefficient
cor(mtcars$wt, mtcars$mpg) # −0.868
cor(mtcars, method="pearson") # full correlation matrix
# Pairs plot — all variables at once
pairs(mtcars[,1:5], pch=19, col="steelblue")
# ggplot2
ggplot(mtcars, aes(x=wt, y=mpg, color=factor(cyl))) +
geom_point(size=3, alpha=0.8) +
geom_smooth(method="lm", se=TRUE) +
labs(title="Weight vs MPG by Cylinders") + theme_minimal()
R — Descriptive Statistics & Visualization Before Analysis Page 10
10. Bar Chart & Pie Chart
Bar charts show frequencies or counts of categorical variables. Pie charts show proportions.
# Base R — bar chart
counts <- table(mtcars$cyl) # frequency table
barplot(counts,
col=c("steelblue","salmon","lightgreen"),
main="Cars by Cylinders", xlab="Cylinders", ylab="Count")
# Pie chart (use sparingly — bar charts are usually better)
pie(counts, labels=paste(names(counts),"cyl"), col=rainbow(3))
# ggplot2 bar chart
ggplot(mtcars, aes(x=factor(cyl), fill=factor(cyl))) +
geom_bar() +
labs(title="Cars by Cylinders", x="Cylinders") + theme_minimal()
■ Prefer bar charts over pie charts — humans are better at comparing lengths than angles.
11. ggplot2 — Grammar of Graphics
ggplot2 is R's most powerful visualization package, built on a layered grammar. Every plot is built from the same
components.
Layer Purpose Examples
Data The dataset to use ggplot(data=mtcars, ...)
Map variables to visual
Aesthetics aes(x=wt, y=mpg, color=cyl)
properties
Geoms The geometric shape to draw geom_point(), geom_line(), geom_bar()
Stats Statistical transformations stat_smooth(), stat_count()
Scales Control axes, colors, sizes scale_color_manual(), scale_x_log10()
Facets Split into subplots facet_wrap(~cyl), facet_grid(am~cyl)
Theme Non-data appearance theme_minimal(), theme_bw()
R — Descriptive Statistics & Visualization Before Analysis Page 11
library(ggplot2)
# Template — every ggplot follows this pattern:
# ggplot(data, aes(x, y)) + geom_*() + labs() + theme_*()
# Histogram
ggplot(mtcars, aes(x=mpg)) + geom_histogram(bins=10, fill="steelblue")
# Boxplot by group
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_boxplot() + theme_minimal() + labs(x="Cylinders")
# Scatter with regression line
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(aes(color=factor(cyl)), size=3) +
geom_smooth(method="lm", se=TRUE, color="black") +
theme_minimal()
# Faceted plot — one panel per cylinder group
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point() + facet_wrap(~cyl) + theme_minimal()
R — Descriptive Statistics & Visualization Before Analysis Page 12
12. Full EDA Workflow Example (mtcars dataset)
# ■■ STEP 1: Load & inspect ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
data(mtcars)
dim(mtcars) # 32 rows, 11 columns
str(mtcars) # variable types
head(mtcars, 5) # first 5 rows
# ■■ STEP 2: Missing values ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
sum([Link](mtcars)) # 0 — no missing values
colSums([Link](mtcars)) # per-column count
# ■■ STEP 3: Summary statistics ■■■■■■■■■■■■■■■■■■■■■■■■■■
summary(mtcars)
library(psych)
describe(mtcars) # adds skewness, kurtosis, SE
# ■■ STEP 4: Distribution of key variable ■■■■■■■■■■■■■■■■■
hist(mtcars$mpg, breaks=12, col="steelblue", border="white",
main="MPG Distribution", xlab="Miles per Gallon")
lines(density(mtcars$mpg), col="red", lwd=2)
skewness(mtcars$mpg) # slight right skew
kurtosis(mtcars$mpg) # platykurtic
# ■■ STEP 5: Outlier detection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
boxplot(mtcars$mpg, main="MPG Boxplot", col="lightblue")
# IQR method
Q1 <- quantile(mtcars$mpg, 0.25)
Q3 <- quantile(mtcars$mpg, 0.75)
IQR_val <- IQR(mtcars$mpg)
outliers <- mtcars$mpg[mtcars$mpg < Q1 - 1.5*IQR_val |
mtcars$mpg > Q3 + 1.5*IQR_val]
# ■■ STEP 6: Relationships ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
cor(mtcars[, c("mpg","wt","hp","disp")]) # correlation matrix
pairs(mtcars[,1:5], pch=19,
col=ifelse(mtcars$cyl==4,"blue",
ifelse(mtcars$cyl==6,"green","red")))
# ■■ STEP 7: Grouped comparisons ■■■■■■■■■■■■■■■■■■■■■■■■■■
tapply(mtcars$mpg, mtcars$cyl, summary)
R — Descriptive Statistics & Visualization Before Analysis Page 13
boxplot(mpg ~ cyl, data=mtcars, col=rainbow(3),
main="MPG by Cylinders")
# ■■ STEP 8: Now proceed to modelling ■■■■■■■■■■■■■■■■■■■■■
model <- lm(mpg ~ wt + hp + cyl, data=mtcars)
summary(model)
■ Always complete Steps 1–7 before Step 8. Skipping EDA leads to wrong models and missed patterns.
Quick Reference — Plot Type Chooser
Data / Goal Best Plot R function
One continuous variable Histogram + density hist() / geom_histogram()
One categorical variable Bar chart barplot() / geom_bar()
Two continuous variables Scatter plot plot() / geom_point()
One continuous, one categorical Boxplot / violin plot boxplot() / geom_boxplot()
Trend over time Line chart plot(type="l") / geom_line()
Proportions of categories Pie / stacked bar pie() / geom_bar(position="fill")
Many variables at once Pairs / heatmap pairs() / corrplot()
Distribution shape Q-Q plot qqnorm(); qqline()
R — Descriptive Statistics & Visualization Before Analysis Page 14