0% found this document useful (0 votes)
8 views34 pages

Main R Cheatsheet

This R Cheatsheet provides a comprehensive overview of basic syntax, data types, structures, and operations in R, including vectors, matrices, factors, and data frames. It also covers exploratory data analysis (EDA) techniques such as frequency tables, histograms, and boxplots, along with methods for identifying outliers and skewness in data. The document serves as a quick reference guide for R programming and data analysis.

Uploaded by

coool coool
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views34 pages

Main R Cheatsheet

This R Cheatsheet provides a comprehensive overview of basic syntax, data types, structures, and operations in R, including vectors, matrices, factors, and data frames. It also covers exploratory data analysis (EDA) techniques such as frequency tables, histograms, and boxplots, along with methods for identifying outliers and skewness in data. The document serves as a quick reference guide for R programming and data analysis.

Uploaded by

coool coool
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

R Cheatsheet

Basic Syntax
Arithmetic Operations: +, -, /, *, ^, %%

Assignments: x <- 5 , x=5

Comments: # This is a comment

Print: print(x)

Data Types
Decimal values like 4.5 are called numerics

Whole numbers like 4 are called integers. Integers are also numerics

Boolean values ( TRUE or FALSE ) are called logical

Text (or string) values ( "Hello" ) are called characters

Structures
Vector

# Creating a vector + shortcut


vec = c(1, 2,..., n) == c(1:n)

# Repeat
rep(1:2, times=3)

# Sequence
seq(1, 10, by=2)

# Name elements
names(vec) <- c(...)

R Cheatsheet 1
# Access -> 1-indexed NOT 0-indexed
vec[1]

# Multiple access
vec[c(1,3)] # using index
named_vec[c("A", "B")] # using names

# Getting desired elements


# filter => vec[logical vector] to select wanted elements
# This does element-wise comparison i.e. all elements compared
vec[vec > 5]

# Vector of 0s of length = n
numeric(n)

# Element-wise operations
vec + 5
vec * 5

# Some vec functions


sum(v), mean(v), max(x), min(x), range(x), cor(x,y) # This means correlation
, sort(x) # Can be used for finding median by sort(x)[middle_index]

# Excluding elements
v[-1] # Exclude 1st element
v[-c(2, 4)] # Exclude 2nd and 4th elements

Matrices
Mainly → matrix(vector_values, nrow = m, ncol = n)

Creates mxn matrix

# byrow = T
matrix(1:9, byrow = TRUE, nrow = 3)

R Cheatsheet 2
> [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

# byrow = F
matrix(1:9, byrow = FALSE, nrow = 3)
> [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

# Using dinames to name matrix


matrix(x, nrow = 3, byrow = TRUE, dimnames = list(titles, region))

# rbind i.e. row bind => attach a new row


rbind(x, y)

# cbind i.e. col bind => attach a new col


cbind(x, y)

# Sums
rowSums(x) # Sums across rows
colSums(x) # Sums across cols

# Means
rowMeans(x)
colMeans(x)

# Naming
rownames(x) = c(...)
colnames(x) = c(...)

# Indexing (1 - Indexed)
x[c(a,b), c(c,d)] # Gets rows a & b of cols c & d
x[, col_selection] # All rows and specific cols

R Cheatsheet 3
x[row_selection, ] # All cols and specific rows
# where row/col_selection can be names i.e. c("sel1", "sel2")

# Transpose, Invert, Find Inverse, Matrix Multiplication


t(x), 1/x, solve(x), x %% y

# Element-wise operations
x+5
x*5

# Excluding elements
x[-1, ] # Include everything except 1st row
x[, -c(2, 4)] # Include everything except 2nd and 4th cols

Factors
Used for nominal categorical variable (NO natural ordering) and an ordinal
categorical variable (HAS a natural ordering)

I mainly use it for renaming cols in a dataframe (covered in dataframes)

# Nominal Categorical
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector

> [1] Elephant Giraffe Donkey Horse


Levels: Donkey Elephant Giraffe Horse

# Ordinal Categorical
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE,
levels = c("Low", "Medium", "High"))
factor_temperature_vector

> [1] High Low High Low Medium

R Cheatsheet 4
Levels: Low < Medium < High

# FOR ORDINAL CATEGORICAL


some_factor[2] # Selects the 2nd factor value
some_factor[3] # Selects the 3rd factor value
some_factor[2] > some_factor[2] # Checks if an ORDERED factor is more than th

# "Renaming" Factors => This will be important for dataframes


levels(factor_vector) <- c("name1", "name2",...)

# Creating Factors
factor(vector, matrix, dataframe$col)

DataFrame
Object with rows and columns

Each row is a subject

Each col is a variable

# Creating Dataframes connected col by col


df <- [Link](a = 1:3, b = c("x", "y", "z"))

# Accessing column and rows


df$col
df[, "col"]
df[1, 2]

# Generating subset based on a columns value


subset(df, a > 1) # Returns rows satisfying the column constraint

# Viewing df
head(df), tail(df), str(df), names(df)

# Renaming df

R Cheatsheet 5
names(df) = c("new_name1", "new_name2")
names(df)[names(df) == "col"] = "name" # Rename particular column / can use c

# Renaming column values


df$col = factor(df$col, levels = c(0, 1), labels = c("no", "yes"))
df$col = ifelse(df$col == 0, "no", "yes")

# Filter / Logical df
logical_df = df == "some_value" # Element-wise comparison
# Mainly useful if want to calculate proport
# col against a denominator e.g. tut 6
desired_df = df[df$col == "some_value, ] # Get all columns with specified row va

subset(df, col == "some_value") # Can apply any logical operator

df %>%
filter(col == "some_value") # dplyr framework to do the same above

# Adding new column


df$new_col = ifelse(df$other_col == "some_value", a, b) # Adding new col based
# logical test o
# can be prim
df = df %>%
mutate(new_col = case_when(
other_col == value_1 ~ a,
other_col == value_2 ~ b
)

# Indexing df
df[1, 2] # Row 1, Column 2 (75)
df[ , 1] # All rows, column 1
df[1:2, ] # Rows 1 and 2, all columns

# Excluding elements
df[-1, ] # Exclude 1st row
df[ , -2] # Exclude 2nd column

R Cheatsheet 6
df[-c(1,3), -1] # Exclude rows 1 & 3, and column 1

# Attach (NOT RECCOMENDED)


attach(df) # Allows col names to be called as is without $ operator
df[col == "some_value"]

# Sums
rowSums(df) # Sums across rows
colSums(df) # Sums across cols

# Means
rowMeans(df)
colMeans(df)

EDA
Frequency Table & Barplot

# Frequency Table
wanted_freq = table(df$wanted_col)

# Proportion Table
[Link](wanted_freq)

# Percentage Table
[Link](wanted_freq) * 100

# Bar Graph
barplot(frequency_table, main = "Bar Plot Title",
xlab = "Responses", ylab = "Frequency", col = "some_colour")
### Some important params are xlim = c(x1, x2), ylim = c(y1, y2)

Done on categorical variables

R Cheatsheet 7
Summarise using:

Modal category

Proportion or percentage for modal category

Histogram & Boxplots


Presenting Histogram

# Read the CSV file (Assuming it's named '[Link]' and has a column named 'BM
data <- [Link]("[Link]")

# 1. Get numerical summaries


summary(data$BMI) # Provides min, 1st quartile, median, mean, 3rd quartile, ma
mean(data$BMI, [Link] = TRUE) # Mean
median(data$BMI, [Link] = TRUE) # Median
sd(data$BMI, [Link] = TRUE) # Standard deviation

# 2. Form a histogram by frequency


hist(data$BMI, main = "Histogram of BMI (Frequency)", xlab = "BMI", col = "lightb

# Form a histogram by probability (normalized)


hist(data$BMI, probability = TRUE, main = "Histogram of BMI (Probability)",
xlab = "BMI", col = "lightgreen")

# 3. Form a boxplot and check for outliers


boxplot(data$BMI, main = "Boxplot of BMI", ylab = "BMI", col = "lightcoral")

# Identify outliers manually


Q1 <- quantile(data$BMI, 0.25, [Link] = TRUE)
Q3 <- quantile(data$BMI, 0.75, [Link] = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Alternatively,*****

R Cheatsheet 8
[Link](data$bmi)

outliers <- data$BMI[data$BMI < lower_bound | data$BMI > upper_bound]


print("Outliers:")
print(outliers)

# 4. Comment on symmetry
skewness <- mean(data$BMI, [Link] = TRUE) - median(data$BMI, [Link] = TRUE)
if (skewness > 0) {
print("The BMI distribution is right-skewed (positively skewed).")
} else if (skewness < 0) {
print("The BMI distribution is left-skewed (negatively skewed).")
} else {
print("The BMI distribution is symmetric.")
}

##### EXTRA #####


# data is some quantitative variable
hist(data, breaks = 20, prob = <TRUE/FALSE>, main = "Some Title")

# Specifying breaks width / interval width


break_seq = seq(0, 200, by = 5) # 0 5 10 ... 195 200
hist(data, breaks = break_seq, prob = <TRUE/FALSE>, main = "Some Title")

# Specifying specific data, mark is some quantitative variable vector


# Or is called after attach(data)
hist(mark[mark > 30], breaks = 20, prob = <TRUE/FALSE>, main = "Some Title")

# Range
range(data) # Will include outliers

# Standard Deviation and Variance


sd(data)
var(data)

# IQR

R Cheatsheet 9
IQR(data)

# Outliers
outliers = df$col[df$col < lower_bound | df$col > upper_bound] # Or,

outliers = data %>%


filter(value < lower_bound | value > upper_bound) %>%
pull(value) # Or,

outliers = [Link](data)$out #### BEST AND EASIEST ####

outliers_index = which(data$col %in% c(outliers))

# Remove outliers
new_vec_no_outliers = df$col[-outliers_index] # col without outliers
new_df_no_outliers = df[-outliers_index, ] # dataframe without outliers

How to summarise histograms?

Overall Pattern:

Look for clusters or gaps

Identify any potential outliers (observations that deviate from the rest)

Unimodal vs. Multimodal:

Unimodal: Single peak or mound

Bell-Shape ⇒ Unimodal & Symmetric

Bimodal/Multimodal: Two or more peaks (modes)

Symmetry vs. Skewness:

Symmetric Distribution: Both sides mirror each other

Skewed Distribution: One tail is longer than the other

Presenting Boxplot

R Cheatsheet 10
# Load data from CSV (assuming column named 'BMI')
data <- [Link]("[Link]")

# 1. Get Numerical Summaries (5-number summary + mean, sd)


summary(data$BMI) # Provides min, 1st quartile, median, mean, 3rd quartile, ma
mean(data$BMI, [Link] = TRUE) # Mean of BMI
median(data$BMI, [Link] = TRUE) # Median of BMI
sd(data$BMI, [Link] = TRUE) # Standard deviation of BMI

# 2. Create a histogram for visualization


hist(data$BMI, main = "Histogram of BMI (Frequency)", xlab = "BMI", col = "lightb

# 3. Create a Boxplot to visualize the distribution of BMI


boxplot(data$BMI, main = "Boxplot of BMI", ylab = "BMI", col = "lightcoral")

# 4. Identify Outliers using IQR method


Q1 <- quantile(data$BMI, 0.25, [Link] = TRUE)
Q3 <- quantile(data$BMI, 0.75, [Link] = TRUE)
IQR_value <- Q3 - Q1 # Interquartile range

# Calculate lower and upper bounds for the whiskers


lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Alternative: Identify outliers using the [Link] function


outliers <- [Link](data$BMI)$out

# Print identified outliers


cat("Outliers identified: \n")
print(outliers)

# 5. Identify the Min and Max of the data that are **not outliers**
non_outliers <- data$BMI[data$BMI >= lower_bound & data$BMI <= upper_bound
cat("Min and Max of non-outlier data: \n")
cat("Min:", min(non_outliers), "\n")

R Cheatsheet 11
cat("Max:", max(non_outliers), "\n")

# 6. Comment on Symmetry (Skewness)


skewness <- mean(data$BMI, [Link] = TRUE) - median(data$BMI, [Link] = TRUE)
if (skewness > 0) {
cat("The BMI distribution is right-skewed (positively skewed).\n")
} else if (skewness < 0) {
cat("The BMI distribution is left-skewed (negatively skewed).\n")
} else {
cat("The BMI distribution is symmetric.\n")
}

Outliers and Skewness


Skewness: A distribution is skewed when one tail is longer than the other

Left Skewed (Negative Skew): Left tail is longer; data pulls to the left

Mean < Median

Right Skewed (Positive Skew): Right tail is longer; data pulls to the right

Mean > Median

Outliers: Skewness can indicate the presence of outliers, typically in the


longer tail

High Skew ? ⇒ Report Median & IQR

Low Skew / ~ Symmetric ? ⇒ Report Mean & Var / Spread

Median & IQR are more robust to outliers than Mean & Variance

Association Between 2 Variables


Contingency Table (2 Categorical Variables)

bc = [Link]("breast_cancer.csv")
# bbd: cancer status, 0 = no; 1 = yes
# pmh: post-menopausal hormone usage: 2 = no; 3 = yes

R Cheatsheet 12
# alcohol: amount of alcohol consumed
# hgt: height of participant
# agemenop: age of participant at menopause
# bmi: body mass index of participant

##### TWO CATEGORICAL VARIABLES #####

# Creating Contingency Table


table(bc$bbd, bc$pmh) # Usually exp variable then resp variable

#create new variables for bbd and pmh:


cancer <- ifelse(bc$bbd=="0","Absent","Present") # saved as vector

[Link] <- ifelse(bc$pmh == "2", "No", "Yes") # saved as vector

# replace the original values of bbd by new labels:


# bbd <- ifelse(bbd=="0","Absent","Present")

tab = table(cancer, [Link]) #[Link] in column


tab
proptab = [Link](tab)*100 # joint probabilities
proptab

tab = table([Link], cancer) # cancer in column

proptab = [Link](tab)*100 # joint probabilities


proptab

### CONDITIONAL PROBABILITIES:

# Taking [Link] as the denominator***


# create conditional percentage (condition on [Link])
proptab = [Link](tab, "[Link]")*100
proptab # total probabilities in EACH ROW (pmh use) is 100%

#or

R Cheatsheet 13
# Taking cancer as the denominator***
# create conditional percentage (condition on cancer)
proptab = [Link](tab, "cancer")*100
proptab # total probabilities in EACH COLUMN (cancer) is 100%

##### CLUSTERED BAR PLOT (has 'beside = TRUE') #####


# it plots the clusters = columns
# we need the conditional probability on [Link]
tab = table(cancer,[Link]) #[Link] in column
tab
proptab = [Link](tab, "[Link]")*100
proptab

# 2 clusters are formed by the column's categories, this is by default.


barplot(proptab, beside = TRUE)

barplot(proptab, beside = TRUE, xlab = "PMH Usage", main="",


col=c("darkblue","red"),legend = rownames(proptab), ylim = c(0, 70) )

tab = table([Link], cancer) #[Link] in column


proptab = [Link](tab, "[Link]")*100
barplot(proptab, beside = TRUE, xlab = "Cancer", main="",
col=c("darkblue","red"),legend = rownames(proptab), ylim = c(0, 70) )

##### STACKED BAR PLOT #####


tab = table(cancer, [Link]) #pmh in column

proptab = [Link](tab, "[Link]")*100 ; proptab

barplot(proptab, xlab = "PMH Usage", main="",col=c("darkblue","red"),


legend = rownames(proptab))

# each bar is 100% for each group of PMH.

R Cheatsheet 14
####### If cancer is in column then:

tab = table([Link], cancer)


tab

proptab = [Link](tab, "[Link]")*100


proptab

barplot(proptab, xlab = "Cancer", main="",col=c(2,5),legend = rownames(propta


#not recommended to interpret percentages when response var is in column

Barplot (1 Categorical Variable & 1 Quantitative Variable)

#bc$agemenop~cancer => This means you put the "agemenop" as the quantitati
#and "cancer" as the categorical variable i.e, y ~ x
boxplot(bc$agemenop~cancer, col = c(5,5), ylab = "Age at Menopause",
xlab = "Cancer Status")

# Alternatively,
boxplot(agemenop ~ cancer, data = data, col = c(5, 5), ylab = "Age at Menopaus
xlab = "Cancer Status")

boxplot(bc$agemenop~cancer)$out # values of all outliers for both boxplots

# identify the group of the outliers (1 = Absent or 2 = Present)


grp = boxplot(bc$agemenop~cancer)$group ; grp

# Tells you which of the outliers belong to values == 1


which(grp ==1) # index of outlier grp == 1 (Absent)
boxplot(bc$agemenop~cancer)$out[which(grp ==1)]# values of outliers in grp =

# Tells you which of the outliers belong to values == 2


which(grp ==2) # index of outlier in grp == 2 (Present)
boxplot(bc$agemenop~cancer)$out[which(grp ==2)]# values of outliers in grp =

R Cheatsheet 15
boxplot(bc$agemenop~cancer)$names

### Boxplot is most suitable to compare 1 quantitative and 1 categorical var


### To do this with histogram, we need to overlap them which is very difficult to

Scatterplot (2 Quantitative variables)

hdb = [Link]("hdb_2017_now.csv")
head(hdb)

#flat_type: 1 room, 2 room, 3 room, 4 room, 5 room, Excecutive and Multi-Genera

boxplot(hdb$resale_price~hdb$flat_type, col = c(2,3,4,5,6,7,8))

hist(hdb$resale_price[which(hdb$flat_type == "3 ROOM")], col = 2,


main = " ", xlab = "Price of 3 ROOM flat")

##### SCATTER PLOT #####

# 1st argument is x axis, 2nd argument is y axis


plot(hdb$floor_area_sqm, hdb$resale_price, col = 2)

# Syntatic Sugar but instead, it is y axis ~ x axis


# Get used to this one as it is used a lot in later chapters
# Read as y is "some combination" of x
plot(hdb$resale_price ~ hdb$floor_area_sqm, col = 5)

# For correlation, if there is no var, correlation = 0


# E.g. if 2 points have same y -value and different x-values and vice versa, there
# is no variance
# But if there are only 2 differing points, r is 1 or -1
# If symmetric, correlation = 0
# Usually remove outliers before this

R Cheatsheet 16
cor(hdb$floor_area_sqm, hdb$resale_price) # cor(x, y) == cor(y, x)

# Sometimes, we change the axes to ignore the outliers to prevent a "squashed l


# But if the y-axis does not start from 0, we will not be able to find out
# the y-intercept from the graph solely

Statistics
Sampling

# With replacement
sales_counts %>%
sample_n(2, replace = TRUE) # Allow duplicates

# Without replacement
sales_counts %>%
sample_n(2) # Don't allow duplicates

Discrete Distribution

# Create probability distribution


size_distribution <- restaurant_groups %>%
count(group_size) %>%

R Cheatsheet 17
mutate(probability = n / sum(n))

# Calculate probability of picking group of 4 or more


size_distribution %>%
# Filter for groups of 4 or larger
filter(group_size >= 4) %>%
# Calculate prob_4_or_more by taking sum of probabilities
summarize(prob_4_or_more = sum(probability))

Continuous Distribution
Probability distribution where all values in a given range [a, b] are equally
likely

# To calculate P(wait time <= 7)


punif(7, min = 0, max = 12)

# To calculate P(wait time >= 7)


punif(7, min = 0, max = 12, [Link] = F)

# To calculate P(4 <= wait time <= 7)


punif(7, min = 0, max = 12) - punif(4, min = 0, max = 12)

R Cheatsheet 18
# Create 1000 random numbers in interval
runif(1000, 0, 30)

Binomial Distribution
X ~ Bin(n, p) where n is the number of trials and p is probability of success

#### Simulating Random Outcomes


# Use rbinom(n, size, prob) to simulate random outcomes from a binomial distrib
# n -> number of random values to generate (observations)
# size -> number of trials (how many times an event is repeated)
# prob -> probability of success in each trial (between 0 and 1)

# Simulate 10 coin flips with 1 trial per flip, probability of heads = 0.5
# If there are x coin flips per trial, then size = x
random_outcomes = rbinom(10, size = 1, prob = 0.5)

#### Binomial Probability Mass Function


# Use dbinom(k, size, prob) to calculate the prob of exactly k successes in n trial
# k -> number of successes
# size -> number of trials
# prob -> probability of success in each trial

# Probability of getting exactly 3 heads in 5 coin flips with probability of heads =


binom_prob = dbinom(3, size = 5, prob = 0.5)

#### Cumulative Probability


# Use pbinom(k, size, prob) to calculate the cumulative probability P(X ≤ k)
# Can specify [Link] = F to calculate P(X > k)

# Cumulative probability of getting 3 or fewer heads in 5 coin flips (P(x <= 3))
cumulative_prob = pbinom(3, size = 5, prob = 0.5)

R Cheatsheet 19
# Cumulative probability of getting more than 3 heads in 5 coin flips (P(X > 3))
cumulative_prob_more_than_3 = pbinom(3, size = 5, prob = 0.5, [Link] = FALS

#### Expected Value


# The expected value for a binomial distribution is E(x) = np
# Where n -> number of trials, p -> probability of success

# Expected number of heads in 5 coin flips with a 0.5 probability of heads


expected_value = 5 * 0.5

Normal Distribution
X ~ N(μ, σ 2 )

##### Finding Values #####


# Use pnorm(value_to_test, mean, sd) to find percent of people that ‘fail’ value_to
# Calculate percent of women shorter than 154
pnorm(154, mean, sd) # P(X < 154)

# Calculate percent of women taller than 154


pnorm(154, mean, sd, [Link] = F) # P(X > 154)

# Calculate percent of women between 154 and 157


pnorm(157, mean, sd) - pnorm(154, mean, sd) # P(154 < X < 157)

# Use qnorm(quantile, mean, sd) to find the exact variable value (e.g., height) tha
# people ‘fail’
# E.g., height of women where 90% are shorter than
qnorm(0.9, mean, sd)

# E.g., height of women where 90% are taller than


qnorm(0.9, mean, sd, [Link] = F)

##### Simulating Outcomes #####

R Cheatsheet 20
# Use rnorm(number_of_observations_to_generate, mean, sd) to simulate random
# Simulate 36 sales with a specific mean and standard deviation
new_sales <- new_sales %>%
mutate(amount = rnorm(36, new_mean, new_sd))

### Generate random samples from a normal distribution


# N -> Number of samples (rows)
# C -> Sample size (columns)
# mu -> Mean of the normal distribution
# sd -> Standard deviation of the normal distribution
N_samples = matrix(rnorm(N * C, mu, sd), nrow = N, ncol = C)

# Calculate the mean of each sample (row)


[Link] = rowMeans(N_samples) # [Link] contains the sample means for each of th

##### Central Limit Theorem (CLT) #####


# To replicate an experiment, use replicate(amount_to_replicate, function_to_repl
# Simulate 10 experiments of taking the mean of 5 dice rolls
sample_means = replicate(10, sample(die, 5, replace = T) %>% mean())

# This is known as a sampling distribution

## Sampling Distribution
# This will approach closer to a normal distribution as the number of trials increa
# / approaches infinity
# This is known as the CLT (Central Limit Theorem)
# Only applies if samples are random and independent
# Large populations
# Don’t know what the actual distribution is

## Mean of Sampling Distribution


# mean(sample_means) gives you the expected value of the sample means x_ba
mean(sample_means)

R Cheatsheet 21
# mean(sample_proportions) gives you the expected proportion of the mean P_h
mean(sample_proportions)

Sampling Distribution of Proportion

Since don’t know anything about p i.e. population proportion, we replace it


with p
^i.e sample proportion

Sampling Distribution of Mean

xBut in practice / when using sample data,

t-Distribution

R Cheatsheet 22
# 1. Plotting the t-distribution with df = 10
x <- seq(-4, 4, length = 100) # Sequence of t-values
y <- dt(x, df = 10) # t-distribution PDF with df = 10

# Plot the t-distribution curve


plot(x, y, type = "l", main = "t-Distribution (df = 10)",
xlab = "t-value", ylab = "Density", col = "blue")

# 2. Finding the cumulative probability for a t-value (e.g., P(t < 2) for df = 10)
probability <- pt(2, df = 10)
cat("Probability that t < 2 (df = 10):", probability, "\n")

# 3. Finding the critical t-value for a given probability (e.g., 95% percentile)
# This takes in the quantile i.e. 1 - a / 2
critical_t_value <- qt(0.95, df = 10)
cat("Critical t-value for 95% confidence (df = 10):", critical_t_value, "\n")

# 4. Generating 100 random t-distributed values with df = 10


random_t_values <- rt(100, df = 10)

# Plotting the histogram of random t-values


hist(random_t_values, breaks = 20, main = "Random t-distributed Values (df = 10)
xlab = "t-values", col = "lightblue", border = "black")

# 5. Overlaying the theoretical t-distribution curve on the histogram


curve(dt(x, df = 10), col = "red", lwd = 2, add = TRUE)

Hypothesis Testing
One-Sample Hypothesis Test (Done in R code)

Two-Sample Hypothesis Test

R Cheatsheet 23
Testing Equal Variances

[Link](sample1, sample2)
# If p-value < 0.05, then unequal else equal

Independent T Test with Equal Variances

# Independent t-test with equal variances


# This is used when you assume equal variances between the two groups

# Example data for two independent samples


group1 <- c(5.1, 6.3, 5.8, 5.6, 6.2)
group2 <- c(7.1, 7.5, 7.3, 6.9, 7.2)

# Perform the t-test assuming equal variances


[Link](group1, group2, alternative = "[Link]", [Link] = TRUE)

# This returns the test statistic, p-value, and confidence interval

Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).

Alternative Hypothesis (H1): The means are not equal (μ1≠μ2).

When to Use: This test is used when the variances of the two groups are
assumed to be the same.

Independent T Test with Unequal Variances

R Cheatsheet 24
# Independent t-test with unequal variances (Welch's t-test)
# This is used when you assume unequal variances between the two groups

# Example data for two independent samples


group1 <- c(5.1, 6.3, 5.8, 5.6, 6.2)
group2 <- c(7.1, 7.5, 7.3, 6.9, 7.2)

# Perform the t-test assuming unequal variances


[Link](group1, group2, alternative = "[Link]", [Link] = FALSE)

# This returns the test statistic, p-value, degrees of freedom, and confidence inte

Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).

Alternative Hypothesis (H1): The means are not equal (μ1≠μ2).

When to Use: This test is used when the variances of the two groups are
unequal (i.e., when the assumption of equal variances is violated).

Dependent T Test

# Paired t-test for dependent samples


# This is used when you have two related samples
# (e.g., pre-treatment and post-treatment data)

R Cheatsheet 25
# Example data for paired samples
pre_treatment <- c(55, 60, 65, 70, 75)
post_treatment <- c(58, 63, 66, 72, 78)

# Perform the paired t-test


[Link](pre_treatment, post_treatment, alternative = "[Link]", paired = TRUE)

# Alternatively,
[Link](diff, mu = 0, alaternatively = "two-sided")

# This returns the test statistic, p-value, and confidence interval

Null Hypothesis (H0): The mean difference between the paired observations
is zero (μD=0).

Alternative Hypothesis (H1): The mean difference is not zero (μD≠0).

When to Use: This test is used when the data is paired, meaning the two
samples are not independent but are related (e.g., measurements before and
after treatment).

Linear Regression
Simple Linear Regression
Only 1 regressor

Assumptions

R Cheatsheet 26
Data obtained by randomisation

y ~ x is linear

Check scatterplot and look at bands

Values of y stable as x increases?

ϵ ~ N (0, σ 2 )⇒ This is the error term

This essentially means constant variance assumption is met

We also implicitly assume that this error term is normal

Computed using error terms:

M1 = lm(y ~ x, data = data)


summary(M1)

# The col name here must match EXACTLY in the original data i.e. same as exp v
vals_to_predict = [Link](x = c(20, 40))
predict(M1, vals_to_predict)

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 4, 6)

# Plot the data points


plot(x, y, main = "Linear Regression", xlab = "x", ylab = "y", pch = 19)

### THIS IS A BIT OF EXTRA KNOWLEDGE BUT WOULD BE GOOD FOR EXAMS
abline(M1, col = "blue", lwd = 2)

R Cheatsheet 27
# Interval estimates
confint(M1, level = 0.95) # To get confint for all intercepts and regressors
confint(M1, 'col_name', level = 0.95) # To get confint for that regressor only
### It is known that the intercept is the mean response

t-tests for β1  ​

Tests if a regressor or term is significant

R Cheatsheet 28
F-tests for β1  ​

Tests if WHOLE model is significant

Null hypothesis: All coefficients are 0 i.e. model is insignificant

Alt hypothesis: At least one of the coefficients, except intercept, are non 0

In Simple linear regression model, hypothesis for F-tests == hypothesis for t-


tests

This is because there is only one regressor β1  ​

df1 = # coefficients

df2 = n - (# coefficients + 1 i.e. from the intercept)

Regression Diagnostics
Randomization: From the steps of data collection

Linearity: can check this assumption using a scatter plot between response Y
and regressor X and the residuals plot

Normality: is checked using the residuals of the built model

Constant variance: is checked using the residuals of built model

Linearity

R Cheatsheet 29
In (2): Linearity is violated

Possible fix is to add higher order terms e.g. x2 

In R, need to wrap around in I ⇒ lm(y ~ x + I(x^2), data = data)

In (3): Variance is not constant

Possible fix is to transform response var y

In R, lm(log(y) ~ x) or lm(sqrt(y) ~ x) or lm(1/y ~ x)

No need I term, log in R is automatically log e == ln

Only need these 3 transformations

Transformations will change the coefficient β1  ​

Residual Plots

Used to check normality, constant variance, whether need to transform y or


add higher order term of x

R Cheatsheet 30
Most commonly used are Standardised Residuals or SR

# How to get SR
car = [Link]("C:/Data/[Link]")
attach(car)
M1 = lm(Selling_Price~Present_Price, data = car)
[Link] = M1$res # These are the raw residuals
SR = rstandard(M1) # These are the standard residuals

Necessary Plots

# SRs (y-axis) against y_i (on x-axis)


# Expect points to scatter randomly about 0 but be between -3 and 3
y.i = fitted(M1)
plot(SR ~ y.i)

# SR against X
# Expect points to scatter randomly about 0 but be between -3 and 3
plot(SR ~ x)

# SR histogram
# Expect to be normally distributed
hist(SR)

R Cheatsheet 31
# QQ plot of SR
qqnorm(SR)
qqline(SR)

Cooks Distance
Outliers are identified when SR > -3 or SR < -3

We need to see whether these points should be dropped or corrected

Influential Points
This is an outlier that affects the model parameters estimates greatly

Outliers may or may not be influential

If the outliers cooks distance is > 1, then it is considered influential

which(SR>3|SR<(-3)) # index of outliers


65 79 83 86 87 95
65 79 83 86 87 95

# If you want the outliers directly


data[data %in% outliers_index]

C = [Link](M1)
which(C>1) # index of influential point
87
87
# Hence, may try to drop the 87th point and fit the model again

Coefficient of Determination R2 
Takes on value between 0 and 1

If there are repeated x values with different y values, R2 can never be 1

If it is equal to 1 ⇒ yi ​ = y
^i ∀i
​ ​

R Cheatsheet 32
Multiple R sq for simple linear
regression

Adj R sq for multiple linear


regression

R Cheatsheet 33
R Cheatsheet 34

You might also like