Main R Cheatsheet
Main R Cheatsheet
Basic Syntax
Arithmetic Operations: +, -, /, *, ^, %%
Print: print(x)
Data Types
Decimal values like 4.5 are called numerics
Whole numbers like 4 are called integers. Integers are also numerics
Structures
Vector
# Repeat
rep(1:2, times=3)
# Sequence
seq(1, 10, by=2)
# Name elements
names(vec) <- c(...)
R Cheatsheet 1
# Access -> 1-indexed NOT 0-indexed
vec[1]
# Multiple access
vec[c(1,3)] # using index
named_vec[c("A", "B")] # using names
# Vector of 0s of length = n
numeric(n)
# Element-wise operations
vec + 5
vec * 5
# Excluding elements
v[-1] # Exclude 1st element
v[-c(2, 4)] # Exclude 2nd and 4th elements
Matrices
Mainly → matrix(vector_values, nrow = m, ncol = n)
# byrow = T
matrix(1:9, byrow = TRUE, nrow = 3)
R Cheatsheet 2
> [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
# byrow = F
matrix(1:9, byrow = FALSE, nrow = 3)
> [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
# Sums
rowSums(x) # Sums across rows
colSums(x) # Sums across cols
# Means
rowMeans(x)
colMeans(x)
# Naming
rownames(x) = c(...)
colnames(x) = c(...)
# Indexing (1 - Indexed)
x[c(a,b), c(c,d)] # Gets rows a & b of cols c & d
x[, col_selection] # All rows and specific cols
R Cheatsheet 3
x[row_selection, ] # All cols and specific rows
# where row/col_selection can be names i.e. c("sel1", "sel2")
# Element-wise operations
x+5
x*5
# Excluding elements
x[-1, ] # Include everything except 1st row
x[, -c(2, 4)] # Include everything except 2nd and 4th cols
Factors
Used for nominal categorical variable (NO natural ordering) and an ordinal
categorical variable (HAS a natural ordering)
# Nominal Categorical
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
# Ordinal Categorical
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE,
levels = c("Low", "Medium", "High"))
factor_temperature_vector
R Cheatsheet 4
Levels: Low < Medium < High
# Creating Factors
factor(vector, matrix, dataframe$col)
DataFrame
Object with rows and columns
# Viewing df
head(df), tail(df), str(df), names(df)
# Renaming df
R Cheatsheet 5
names(df) = c("new_name1", "new_name2")
names(df)[names(df) == "col"] = "name" # Rename particular column / can use c
# Filter / Logical df
logical_df = df == "some_value" # Element-wise comparison
# Mainly useful if want to calculate proport
# col against a denominator e.g. tut 6
desired_df = df[df$col == "some_value, ] # Get all columns with specified row va
df %>%
filter(col == "some_value") # dplyr framework to do the same above
# Indexing df
df[1, 2] # Row 1, Column 2 (75)
df[ , 1] # All rows, column 1
df[1:2, ] # Rows 1 and 2, all columns
# Excluding elements
df[-1, ] # Exclude 1st row
df[ , -2] # Exclude 2nd column
R Cheatsheet 6
df[-c(1,3), -1] # Exclude rows 1 & 3, and column 1
# Sums
rowSums(df) # Sums across rows
colSums(df) # Sums across cols
# Means
rowMeans(df)
colMeans(df)
EDA
Frequency Table & Barplot
# Frequency Table
wanted_freq = table(df$wanted_col)
# Proportion Table
[Link](wanted_freq)
# Percentage Table
[Link](wanted_freq) * 100
# Bar Graph
barplot(frequency_table, main = "Bar Plot Title",
xlab = "Responses", ylab = "Frequency", col = "some_colour")
### Some important params are xlim = c(x1, x2), ylim = c(y1, y2)
R Cheatsheet 7
Summarise using:
Modal category
# Read the CSV file (Assuming it's named '[Link]' and has a column named 'BM
data <- [Link]("[Link]")
# Alternatively,*****
R Cheatsheet 8
[Link](data$bmi)
# 4. Comment on symmetry
skewness <- mean(data$BMI, [Link] = TRUE) - median(data$BMI, [Link] = TRUE)
if (skewness > 0) {
print("The BMI distribution is right-skewed (positively skewed).")
} else if (skewness < 0) {
print("The BMI distribution is left-skewed (negatively skewed).")
} else {
print("The BMI distribution is symmetric.")
}
# Range
range(data) # Will include outliers
# IQR
R Cheatsheet 9
IQR(data)
# Outliers
outliers = df$col[df$col < lower_bound | df$col > upper_bound] # Or,
# Remove outliers
new_vec_no_outliers = df$col[-outliers_index] # col without outliers
new_df_no_outliers = df[-outliers_index, ] # dataframe without outliers
Overall Pattern:
Identify any potential outliers (observations that deviate from the rest)
Presenting Boxplot
R Cheatsheet 10
# Load data from CSV (assuming column named 'BMI')
data <- [Link]("[Link]")
# 5. Identify the Min and Max of the data that are **not outliers**
non_outliers <- data$BMI[data$BMI >= lower_bound & data$BMI <= upper_bound
cat("Min and Max of non-outlier data: \n")
cat("Min:", min(non_outliers), "\n")
R Cheatsheet 11
cat("Max:", max(non_outliers), "\n")
Left Skewed (Negative Skew): Left tail is longer; data pulls to the left
Right Skewed (Positive Skew): Right tail is longer; data pulls to the right
Median & IQR are more robust to outliers than Mean & Variance
bc = [Link]("breast_cancer.csv")
# bbd: cancer status, 0 = no; 1 = yes
# pmh: post-menopausal hormone usage: 2 = no; 3 = yes
R Cheatsheet 12
# alcohol: amount of alcohol consumed
# hgt: height of participant
# agemenop: age of participant at menopause
# bmi: body mass index of participant
#or
R Cheatsheet 13
# Taking cancer as the denominator***
# create conditional percentage (condition on cancer)
proptab = [Link](tab, "cancer")*100
proptab # total probabilities in EACH COLUMN (cancer) is 100%
R Cheatsheet 14
####### If cancer is in column then:
#bc$agemenop~cancer => This means you put the "agemenop" as the quantitati
#and "cancer" as the categorical variable i.e, y ~ x
boxplot(bc$agemenop~cancer, col = c(5,5), ylab = "Age at Menopause",
xlab = "Cancer Status")
# Alternatively,
boxplot(agemenop ~ cancer, data = data, col = c(5, 5), ylab = "Age at Menopaus
xlab = "Cancer Status")
R Cheatsheet 15
boxplot(bc$agemenop~cancer)$names
hdb = [Link]("hdb_2017_now.csv")
head(hdb)
R Cheatsheet 16
cor(hdb$floor_area_sqm, hdb$resale_price) # cor(x, y) == cor(y, x)
Statistics
Sampling
# With replacement
sales_counts %>%
sample_n(2, replace = TRUE) # Allow duplicates
# Without replacement
sales_counts %>%
sample_n(2) # Don't allow duplicates
Discrete Distribution
R Cheatsheet 17
mutate(probability = n / sum(n))
Continuous Distribution
Probability distribution where all values in a given range [a, b] are equally
likely
R Cheatsheet 18
# Create 1000 random numbers in interval
runif(1000, 0, 30)
Binomial Distribution
X ~ Bin(n, p) where n is the number of trials and p is probability of success
# Simulate 10 coin flips with 1 trial per flip, probability of heads = 0.5
# If there are x coin flips per trial, then size = x
random_outcomes = rbinom(10, size = 1, prob = 0.5)
# Cumulative probability of getting 3 or fewer heads in 5 coin flips (P(x <= 3))
cumulative_prob = pbinom(3, size = 5, prob = 0.5)
R Cheatsheet 19
# Cumulative probability of getting more than 3 heads in 5 coin flips (P(X > 3))
cumulative_prob_more_than_3 = pbinom(3, size = 5, prob = 0.5, [Link] = FALS
Normal Distribution
X ~ N(μ, σ 2 )
# Use qnorm(quantile, mean, sd) to find the exact variable value (e.g., height) tha
# people ‘fail’
# E.g., height of women where 90% are shorter than
qnorm(0.9, mean, sd)
R Cheatsheet 20
# Use rnorm(number_of_observations_to_generate, mean, sd) to simulate random
# Simulate 36 sales with a specific mean and standard deviation
new_sales <- new_sales %>%
mutate(amount = rnorm(36, new_mean, new_sd))
## Sampling Distribution
# This will approach closer to a normal distribution as the number of trials increa
# / approaches infinity
# This is known as the CLT (Central Limit Theorem)
# Only applies if samples are random and independent
# Large populations
# Don’t know what the actual distribution is
R Cheatsheet 21
# mean(sample_proportions) gives you the expected proportion of the mean P_h
mean(sample_proportions)
t-Distribution
R Cheatsheet 22
# 1. Plotting the t-distribution with df = 10
x <- seq(-4, 4, length = 100) # Sequence of t-values
y <- dt(x, df = 10) # t-distribution PDF with df = 10
# 2. Finding the cumulative probability for a t-value (e.g., P(t < 2) for df = 10)
probability <- pt(2, df = 10)
cat("Probability that t < 2 (df = 10):", probability, "\n")
# 3. Finding the critical t-value for a given probability (e.g., 95% percentile)
# This takes in the quantile i.e. 1 - a / 2
critical_t_value <- qt(0.95, df = 10)
cat("Critical t-value for 95% confidence (df = 10):", critical_t_value, "\n")
Hypothesis Testing
One-Sample Hypothesis Test (Done in R code)
R Cheatsheet 23
Testing Equal Variances
[Link](sample1, sample2)
# If p-value < 0.05, then unequal else equal
Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).
When to Use: This test is used when the variances of the two groups are
assumed to be the same.
R Cheatsheet 24
# Independent t-test with unequal variances (Welch's t-test)
# This is used when you assume unequal variances between the two groups
# This returns the test statistic, p-value, degrees of freedom, and confidence inte
Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).
When to Use: This test is used when the variances of the two groups are
unequal (i.e., when the assumption of equal variances is violated).
Dependent T Test
R Cheatsheet 25
# Example data for paired samples
pre_treatment <- c(55, 60, 65, 70, 75)
post_treatment <- c(58, 63, 66, 72, 78)
# Alternatively,
[Link](diff, mu = 0, alaternatively = "two-sided")
Null Hypothesis (H0): The mean difference between the paired observations
is zero (μD=0).
When to Use: This test is used when the data is paired, meaning the two
samples are not independent but are related (e.g., measurements before and
after treatment).
Linear Regression
Simple Linear Regression
Only 1 regressor
Assumptions
R Cheatsheet 26
Data obtained by randomisation
y ~ x is linear
# The col name here must match EXACTLY in the original data i.e. same as exp v
vals_to_predict = [Link](x = c(20, 40))
predict(M1, vals_to_predict)
# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 4, 6)
### THIS IS A BIT OF EXTRA KNOWLEDGE BUT WOULD BE GOOD FOR EXAMS
abline(M1, col = "blue", lwd = 2)
R Cheatsheet 27
# Interval estimates
confint(M1, level = 0.95) # To get confint for all intercepts and regressors
confint(M1, 'col_name', level = 0.95) # To get confint for that regressor only
### It is known that the intercept is the mean response
t-tests for β1
R Cheatsheet 28
F-tests for β1
Alt hypothesis: At least one of the coefficients, except intercept, are non 0
df1 = # coefficients
Regression Diagnostics
Randomization: From the steps of data collection
Linearity: can check this assumption using a scatter plot between response Y
and regressor X and the residuals plot
Linearity
R Cheatsheet 29
In (2): Linearity is violated
Residual Plots
R Cheatsheet 30
Most commonly used are Standardised Residuals or SR
# How to get SR
car = [Link]("C:/Data/[Link]")
attach(car)
M1 = lm(Selling_Price~Present_Price, data = car)
[Link] = M1$res # These are the raw residuals
SR = rstandard(M1) # These are the standard residuals
Necessary Plots
# SR against X
# Expect points to scatter randomly about 0 but be between -3 and 3
plot(SR ~ x)
# SR histogram
# Expect to be normally distributed
hist(SR)
R Cheatsheet 31
# QQ plot of SR
qqnorm(SR)
qqline(SR)
Cooks Distance
Outliers are identified when SR > -3 or SR < -3
Influential Points
This is an outlier that affects the model parameters estimates greatly
C = [Link](M1)
which(C>1) # index of influential point
87
87
# Hence, may try to drop the 87th point and fit the model again
Coefficient of Determination R2
Takes on value between 0 and 1
If it is equal to 1 ⇒ yi = y
^i ∀i
R Cheatsheet 32
Multiple R sq for simple linear
regression
R Cheatsheet 33
R Cheatsheet 34