WEEK 2 LAB EXERCISE – Exploring built in datasets
Duration: 2 hours
Mode: Guided and Hands-on Practice
Module: CT127-3-2 Programming for Data Analysis
Lecturer: Dr. Kulothunkan Palasundram (Dr. Kulo)
Objective
By the end of this lab, students will be able to:
1. Write basic R commands to perform simple exploratory data analysis on 2 built
in datasets – iris and mtcars
Dataset: Iris dataset
Part A — Explore iris
Load and inspect the famous Fisher’s iris dataset by running the commands below
one by one
data(iris) # Loads the built-in iris dataset
dim(iris)
names(iris) # Column names
str(iris) # Structure: types for each column;
head(iris, 3) # First 3 rows
summary(iris) # Summary stats for numeric; counts for factor
table(iris$Species) # Frequency table for the Species
Part B — Perform basic manipulations - Select, Filter, Sort
Create subsets of the data frame and order rows by a variable.
# Select two columns into a new object sl
sl <- iris[, c("[Link]", "Species")]
head(sl)
setosa_big <- subset(iris, Species == "setosa" & [Link] > 5)
nrow(setosa_big)
# Reorder rows
sorted <- iris[order(iris$[Link], decreasing = TRUE), ]
# Show top 5 rows and first two columns
head(sorted, 5)[, 1:2]
Notes:
• subset() uses a logical condition to filter rows.
• order() returns row indices for sorting; use inside [ ] to reorder the data frame.
1
Part C — New Variables & Grouping
Create a ratio variable and bin a numeric variable into categories; compute grouped
means.
# New numeric column
iris$[Link] <- iris$[Link] / iris$[Link]
summary(iris$[Link])
iris$SepalLenCat <- cut (
iris$[Link],
breaks = c(-Inf, 5.5, 6.5, Inf),
labels = c("short", "medium", "long")
)
table(iris$SepalLenCat)
# Mean [Link] per Species
tapply(iris$[Link], iris$Species, mean)
Notes:
• cut() converts a continuous variable into categorical bins (factor).
• tapply(x, g, f) applies f to x within each group g.
Part D — Quick Visuals (base R)
Produce a histogram, a grouped boxplot, and a scatterplot.
hist(iris$[Link], # Numeric vector for histogram
main = "Histogram of Sepal Length", # Title
xlab = "[Link]") # X-axis label
boxplot([Link] ~ Species, data = iris, # Formula: y ~ group
main = "Sepal Length by Species", # Title
ylab = "[Link]") # Y-axis label
plot(iris$[Link], iris$[Link], # Scatter: x then y
xlab = "[Link]", ylab = "[Link]",
pch = 19) # Solid points
Notes:
• Histogram shows a distribution; boxplot compares groups; scatter shows
relationships.
2
Dataset: mtcars
Part A — Explore mtcars
#load the dataset
data(mtcars)
1. Write R command to return the number of rows and columns in the dataset
2. List down the column names of the mtcars dataset
3. Identify the data types for all the columns
Part B — Checking the data distribution
4. Write R commands to get the average and middle point for a column (choose any
column)
Part C — Visuals
5. Plot a histogram for mpg. Explain what you see.
6. Plot a grouped boxplot. What is a boxplot used for?
7. Plot a scatterplot of hp and mpg. Explain the relationship between the 2
variables.
Part F — Data Wrangling (Base R: New Variables, Cross-Tabs, Reshape)
1. Create a new variable called mpg_band
The new variable will have values as defined below
Mpg Mpg_band
< 18 low
18 to 25 medium
> 25 high
2. Write a command to find out how many cars there are in each band
3. Write a command to split the cars based on their weight. Any cars weighing
more than the median should be categorized as heavy otherwise light. How
many are there in each category?