0% found this document useful (0 votes)
9 views14 pages

Descriptive Statistics Visualization R

The document provides a comprehensive overview of descriptive statistics and visualization techniques in R, emphasizing the importance of understanding data before analysis. It covers key concepts such as measures of central tendency, variability, and essential plots like histograms, boxplots, and scatter plots. Additionally, it outlines a full exploratory data analysis (EDA) workflow using the mtcars dataset.

Uploaded by

Kajal Singh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Descriptive Statistics Visualization R

The document provides a comprehensive overview of descriptive statistics and visualization techniques in R, emphasizing the importance of understanding data before analysis. It covers key concepts such as measures of central tendency, variability, and essential plots like histograms, boxplots, and scatter plots. Additionally, it outlines a full exploratory data analysis (EDA) workflow using the mtcars dataset.

Uploaded by

Kajal Singh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Descriptive Statistics &

Visualization Before Analysis in R


Understand your data before modelling — measures of center, spread, shape, and the essential plots every
analyst needs

Section Topic

1 What is Descriptive Statistics?

2 Measures of Central Tendency

3 Measures of Spread / Variability

4 Shape: Skewness & Kurtosis

5 Five-Number Summary & summary()

6 Why Visualize Before Analysis?

7 Essential Plots — Histogram

8 Essential Plots — Boxplot

9 Essential Plots — Scatter Plot

10 Essential Plots — Bar Chart & Pie Chart

11 ggplot2 — Grammar of Graphics

12 Full Workflow Example

R — Descriptive Statistics & Visualization Before Analysis Page 1


1. What is Descriptive Statistics?

Descriptive statistics summarise and describe the main features of a dataset without making inferences beyond
the data itself. It answers: What does my data look like?

Category What it measures R functions

Central Tendency Where the data is centered mean(), median(), mode()

Spread / Variability How spread out the data is var(), sd(), range(), IQR()

Shape Symmetry and tail heaviness skewness(), kurtosis() (e3071)

Position Relative standing of values quantile(), rank(), scale()

Frequency Count of categories table(), [Link]()

■ Rule: Always run descriptive statistics AND visualize your data BEFORE any modelling or hypothesis testing.

2. Measures of Central Tendency

Mean — Arithmetic Average


Sum of all values divided by n. Sensitive to outliers.

x <- c(10, 20, 30, 40, 50)

mean(x) # 30
mean(x, [Link]=TRUE) # ignores NA values

# Weighted mean

[Link](x, w = c(1,2,3,2,1))

Median — Middle Value


The value that splits the sorted data in half. Robust to outliers — preferred for skewed data.

median(x) # 30

# Median is better than mean when data is skewed

# e.g. income data — a few billionaires inflate the mean

Mode — Most Frequent Value


R has no built-in mode() for data (mode() returns storage type). Use table() instead.

x <- c(1, 2, 2, 3, 3, 3, 4)

names([Link](table(x))) # "3"

# Or using e1071 package


# [Link]("e1071")

R — Descriptive Statistics & Visualization Before Analysis Page 2


library(e1071)

mode_val <- [Link](names(table(x))[[Link](table(x))])

3. Measures of Spread / Variability

Measure Formula / Description R function Notes

range(x);
Range Max − Min Very sensitive to outliers
diff(range(x))

Avg squared deviation from


Variance var(x) In squared units
mean

Std Dev Square root of variance sd(x) Same unit as data

IQR Q3 − Q1 IQR(x) Robust to outliers

MAD Median Abs Deviation mad(x) Most robust

CV sd/mean × 100% sd(x)/mean(x)*100 Unit-free comparison

x <- c(4, 8, 15, 16, 23, 42)

range(x) # 4 42
diff(range(x)) # 38
var(x) # 188.7
sd(x) # 13.74
IQR(x) # 14.25
mad(x) # 10.38

sd(x)/mean(x)*100 # CV = 73.3%

R — Descriptive Statistics & Visualization Before Analysis Page 3


R — Descriptive Statistics & Visualization Before Analysis Page 4
4. Shape: Skewness & Kurtosis

Skewness — Asymmetry of the distribution

Value Interpretation

Skewness < −1 Strongly left-skewed (long left tail)

−1 to −0.5 Moderately left-skewed

−0.5 to +0.5 Approximately symmetric

+0.5 to +1 Moderately right-skewed

Skewness > +1 Strongly right-skewed (long right tail — e.g. income, house prices)

Kurtosis — Tail heaviness

Type Excess Kurtosis Description

Mesokurtic ≈ 0 Normal distribution — baseline

Leptokurtic > 0 Heavy tails, sharp peak — more extreme outliers

Platykurtic < 0 Light tails, flat peak — fewer extreme values

# [Link]("e1071")

library(e1071)

x <- c(2, 4, 4, 4, 5, 5, 7, 9)
skewness(x) # positive = right skewed
kurtosis(x) # excess kurtosis (normal = 0)

# Base R alternative
n <- length(x)

sk <- (sum((x - mean(x))^3)/n) / (sd(x)^3)

5. Five-Number Summary & summary()

The five-number summary gives a complete picture of distribution shape in just 5 values.

R — Descriptive Statistics & Visualization Before Analysis Page 5


Statistic What it is R function

Min Smallest value min(x)

Q1 25th percentile (lower quartile) quantile(x, 0.25)

Median 50th percentile (middle value) median(x)

Q3 75th percentile (upper quartile) quantile(x, 0.75)

Max Largest value max(x)

IQR Q3 − Q1 (middle 50% of data) IQR(x)

x <- c(3, 7, 8, 5, 12, 14, 21, 13, 18)

# Five-number summary
fivenum(x) # Min Q1 Median Q3 Max
quantile(x) # 0% 25% 50% 75% 100%
quantile(x, c(0.1, 0.9)) # 10th and 90th percentile

# Full descriptive summary (most useful function in R)


summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 3.0 6.5 12.0 11.2 13.5 21.0

# For a data frame — summarizes ALL columns at once


summary(mtcars)

# psych package — even more detail


# [Link]("psych")
library(psych)

describe(mtcars)

■ summary() is your first stop for every new dataset. Run it before anything else.

R — Descriptive Statistics & Visualization Before Analysis Page 6


6. Why Visualize Before Analysis?

Raw statistics alone can be dangerously misleading. Always plot your data first.

■ Anscombe's Quartet: All 4 datasets share identical mean, variance, correlation and regression line — yet look totally different.
Only a plot reveals the truth.

The EDA (Exploratory Data Analysis) workflow:

Step Action R code

1 Load data and check structure str(df); head(df); dim(df)

2 Run summary statistics summary(df); describe(df)

3 Check missing values sum([Link](df)); colSums([Link](df))

4 Plot distributions (histogram/density) hist(x); plot(density(x))

5 Check for outliers (boxplot) boxplot(x)

6 Explore relationships (scatter/corr) plot(x,y); cor(df)

7 Then proceed to modelling lm(), glm(), [Link](), …

R — Descriptive Statistics & Visualization Before Analysis Page 7


7. Histogram

Shows the distribution of a single continuous variable by dividing it into bins and counting frequencies.

# Base R

hist(mtcars$mpg)
hist(mtcars$mpg, breaks=15, col="steelblue", border="white",
main="Distribution of MPG", xlab="Miles per Gallon")

# Density curve overlay


hist(mtcars$mpg, freq=FALSE, col="lightblue")
lines(density(mtcars$mpg), col="darkblue", lwd=2)

# ggplot2
library(ggplot2)
ggplot(mtcars, aes(x=mpg)) +
geom_histogram(bins=12, fill="steelblue", color="white") +
geom_density(aes(y=after_stat(count)), color="red", lwd=1) +

labs(title="MPG Distribution") + theme_minimal()

■ Use breaks= in base R or bins= in ggplot2 to control bin width. Too few = hides detail. Too many = noisy.

8. Boxplot (Box-and-Whisker Plot)

Displays the five-number summary visually. Excellent for comparing groups and spotting outliers.

# Base R boxplot

R — Descriptive Statistics & Visualization Before Analysis Page 8


boxplot(mpg ~ cyl, data=mtcars,
col=c("lightblue","lightgreen","salmon"),
main="MPG by Cylinders", xlab="Cylinders", ylab="MPG")

# ggplot2 boxplot
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_boxplot() +
geom_jitter(width=0.1, alpha=0.4) + # show actual points

labs(title="MPG by Cylinders") + theme_minimal()

R — Descriptive Statistics & Visualization Before Analysis Page 9


9. Scatter Plot

Reveals the relationship between two continuous variables. Essential before computing correlation.

# Base R scatter

plot(mtcars$wt, mtcars$mpg,
xlab="Weight", ylab="MPG",
main="Weight vs MPG", pch=19, col="steelblue")
abline(lm(mpg~wt, data=mtcars), col="red", lwd=2)

# Correlation coefficient
cor(mtcars$wt, mtcars$mpg) # −0.868
cor(mtcars, method="pearson") # full correlation matrix

# Pairs plot — all variables at once


pairs(mtcars[,1:5], pch=19, col="steelblue")

# ggplot2
ggplot(mtcars, aes(x=wt, y=mpg, color=factor(cyl))) +
geom_point(size=3, alpha=0.8) +
geom_smooth(method="lm", se=TRUE) +

labs(title="Weight vs MPG by Cylinders") + theme_minimal()

R — Descriptive Statistics & Visualization Before Analysis Page 10


10. Bar Chart & Pie Chart

Bar charts show frequencies or counts of categorical variables. Pie charts show proportions.

# Base R — bar chart

counts <- table(mtcars$cyl) # frequency table


barplot(counts,
col=c("steelblue","salmon","lightgreen"),
main="Cars by Cylinders", xlab="Cylinders", ylab="Count")

# Pie chart (use sparingly — bar charts are usually better)


pie(counts, labels=paste(names(counts),"cyl"), col=rainbow(3))

# ggplot2 bar chart


ggplot(mtcars, aes(x=factor(cyl), fill=factor(cyl))) +
geom_bar() +

labs(title="Cars by Cylinders", x="Cylinders") + theme_minimal()

■ Prefer bar charts over pie charts — humans are better at comparing lengths than angles.

11. ggplot2 — Grammar of Graphics

ggplot2 is R's most powerful visualization package, built on a layered grammar. Every plot is built from the same
components.

Layer Purpose Examples

Data The dataset to use ggplot(data=mtcars, ...)

Map variables to visual


Aesthetics aes(x=wt, y=mpg, color=cyl)
properties

Geoms The geometric shape to draw geom_point(), geom_line(), geom_bar()

Stats Statistical transformations stat_smooth(), stat_count()

Scales Control axes, colors, sizes scale_color_manual(), scale_x_log10()

Facets Split into subplots facet_wrap(~cyl), facet_grid(am~cyl)

Theme Non-data appearance theme_minimal(), theme_bw()

R — Descriptive Statistics & Visualization Before Analysis Page 11


library(ggplot2)

# Template — every ggplot follows this pattern:


# ggplot(data, aes(x, y)) + geom_*() + labs() + theme_*()

# Histogram
ggplot(mtcars, aes(x=mpg)) + geom_histogram(bins=10, fill="steelblue")

# Boxplot by group
ggplot(mtcars, aes(x=factor(cyl), y=mpg, fill=factor(cyl))) +
geom_boxplot() + theme_minimal() + labs(x="Cylinders")

# Scatter with regression line


ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(aes(color=factor(cyl)), size=3) +
geom_smooth(method="lm", se=TRUE, color="black") +
theme_minimal()

# Faceted plot — one panel per cylinder group


ggplot(mtcars, aes(x=wt, y=mpg)) +

geom_point() + facet_wrap(~cyl) + theme_minimal()

R — Descriptive Statistics & Visualization Before Analysis Page 12


12. Full EDA Workflow Example (mtcars dataset)

# ■■ STEP 1: Load & inspect ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■

data(mtcars)
dim(mtcars) # 32 rows, 11 columns
str(mtcars) # variable types
head(mtcars, 5) # first 5 rows

# ■■ STEP 2: Missing values ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


sum([Link](mtcars)) # 0 — no missing values
colSums([Link](mtcars)) # per-column count

# ■■ STEP 3: Summary statistics ■■■■■■■■■■■■■■■■■■■■■■■■■■


summary(mtcars)

library(psych)
describe(mtcars) # adds skewness, kurtosis, SE

# ■■ STEP 4: Distribution of key variable ■■■■■■■■■■■■■■■■■


hist(mtcars$mpg, breaks=12, col="steelblue", border="white",
main="MPG Distribution", xlab="Miles per Gallon")
lines(density(mtcars$mpg), col="red", lwd=2)

skewness(mtcars$mpg) # slight right skew


kurtosis(mtcars$mpg) # platykurtic

# ■■ STEP 5: Outlier detection ■■■■■■■■■■■■■■■■■■■■■■■■■■■■


boxplot(mtcars$mpg, main="MPG Boxplot", col="lightblue")

# IQR method
Q1 <- quantile(mtcars$mpg, 0.25)
Q3 <- quantile(mtcars$mpg, 0.75)
IQR_val <- IQR(mtcars$mpg)
outliers <- mtcars$mpg[mtcars$mpg < Q1 - 1.5*IQR_val |
mtcars$mpg > Q3 + 1.5*IQR_val]

# ■■ STEP 6: Relationships ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


cor(mtcars[, c("mpg","wt","hp","disp")]) # correlation matrix

pairs(mtcars[,1:5], pch=19,
col=ifelse(mtcars$cyl==4,"blue",
ifelse(mtcars$cyl==6,"green","red")))

# ■■ STEP 7: Grouped comparisons ■■■■■■■■■■■■■■■■■■■■■■■■■■


tapply(mtcars$mpg, mtcars$cyl, summary)

R — Descriptive Statistics & Visualization Before Analysis Page 13


boxplot(mpg ~ cyl, data=mtcars, col=rainbow(3),
main="MPG by Cylinders")

# ■■ STEP 8: Now proceed to modelling ■■■■■■■■■■■■■■■■■■■■■


model <- lm(mpg ~ wt + hp + cyl, data=mtcars)

summary(model)

■ Always complete Steps 1–7 before Step 8. Skipping EDA leads to wrong models and missed patterns.

Quick Reference — Plot Type Chooser

Data / Goal Best Plot R function

One continuous variable Histogram + density hist() / geom_histogram()

One categorical variable Bar chart barplot() / geom_bar()

Two continuous variables Scatter plot plot() / geom_point()

One continuous, one categorical Boxplot / violin plot boxplot() / geom_boxplot()

Trend over time Line chart plot(type="l") / geom_line()

Proportions of categories Pie / stacked bar pie() / geom_bar(position="fill")

Many variables at once Pairs / heatmap pairs() / corrplot()

Distribution shape Q-Q plot qqnorm(); qqline()

R — Descriptive Statistics & Visualization Before Analysis Page 14

You might also like