0% found this document useful (0 votes)

8 views34 pages

Main R Cheatsheet

This R Cheatsheet provides a comprehensive overview of basic syntax, data types, structures, and operations in R, including vectors, matrices, factors, and data frames. It also covers exploratory data analysis (EDA) techniques such as frequency tables, histograms, and boxplots, along with methods for identifying outliers and skewness in data. The document serves as a quick reference guide for R programming and data analysis.

Uploaded by

coool coool

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views34 pages

Main R Cheatsheet

Uploaded by

coool coool

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

R Cheatsheet

Basic Syntax
Arithmetic Operations: +, -, /, *, ^, %%

Assignments: x <- 5 , x=5

Comments: # This is a comment

Print: print(x)

Data Types
Decimal values like 4.5 are called numerics

Whole numbers like 4 are called integers. Integers are also numerics

Boolean values ( TRUE or FALSE ) are called logical

Text (or string) values ( "Hello" ) are called characters

Structures
Vector

# Creating a vector + shortcut

vec = c(1, 2,..., n) == c(1:n)

# Repeat
rep(1:2, times=3)

# Sequence
seq(1, 10, by=2)

# Name elements
names(vec) <- c(...)

R Cheatsheet 1
# Access -> 1-indexed NOT 0-indexed
vec[1]

# Multiple access
vec[c(1,3)] # using index
named_vec[c("A", "B")] # using names

# Getting desired elements

# filter => vec[logical vector] to select wanted elements
# This does element-wise comparison i.e. all elements compared
vec[vec > 5]

# Vector of 0s of length = n
numeric(n)

# Element-wise operations
vec + 5
vec * 5

# Some vec functions

sum(v), mean(v), max(x), min(x), range(x), cor(x,y) # This means correlation
, sort(x) # Can be used for finding median by sort(x)[middle_index]

# Excluding elements
v[-1] # Exclude 1st element
v[-c(2, 4)] # Exclude 2nd and 4th elements

Matrices
Mainly → matrix(vector_values, nrow = m, ncol = n)

Creates mxn matrix

# byrow = T
matrix(1:9, byrow = TRUE, nrow = 3)

R Cheatsheet 2
> [,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

# byrow = F
matrix(1:9, byrow = FALSE, nrow = 3)
> [,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

# Using dinames to name matrix

matrix(x, nrow = 3, byrow = TRUE, dimnames = list(titles, region))

# rbind i.e. row bind => attach a new row

rbind(x, y)

# cbind i.e. col bind => attach a new col

cbind(x, y)

# Sums
rowSums(x) # Sums across rows
colSums(x) # Sums across cols

# Means
rowMeans(x)
colMeans(x)

# Naming
rownames(x) = c(...)
colnames(x) = c(...)

# Indexing (1 - Indexed)
x[c(a,b), c(c,d)] # Gets rows a & b of cols c & d
x[, col_selection] # All rows and specific cols

R Cheatsheet 3
x[row_selection, ] # All cols and specific rows
# where row/col_selection can be names i.e. c("sel1", "sel2")

# Transpose, Invert, Find Inverse, Matrix Multiplication

t(x), 1/x, solve(x), x %% y

# Element-wise operations
x+5
x*5

# Excluding elements
x[-1, ] # Include everything except 1st row
x[, -c(2, 4)] # Include everything except 2nd and 4th cols

Factors
Used for nominal categorical variable (NO natural ordering) and an ordinal
categorical variable (HAS a natural ordering)

I mainly use it for renaming cols in a dataframe (covered in dataframes)

# Nominal Categorical
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector

> [1] Elephant Giraffe Donkey Horse

Levels: Donkey Elephant Giraffe Horse

# Ordinal Categorical
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE,
levels = c("Low", "Medium", "High"))
factor_temperature_vector

> [1] High Low High Low Medium

R Cheatsheet 4
Levels: Low < Medium < High

# FOR ORDINAL CATEGORICAL

some_factor[2] # Selects the 2nd factor value
some_factor[3] # Selects the 3rd factor value
some_factor[2] > some_factor[2] # Checks if an ORDERED factor is more than th

# "Renaming" Factors => This will be important for dataframes

levels(factor_vector) <- c("name1", "name2",...)

# Creating Factors
factor(vector, matrix, dataframe$col)

DataFrame
Object with rows and columns

Each row is a subject

Each col is a variable

# Creating Dataframes connected col by col

df <- [Link](a = 1:3, b = c("x", "y", "z"))

# Accessing column and rows

df$col
df[, "col"]
df[1, 2]

# Generating subset based on a columns value

subset(df, a > 1) # Returns rows satisfying the column constraint

# Viewing df
head(df), tail(df), str(df), names(df)

# Renaming df

R Cheatsheet 5
names(df) = c("new_name1", "new_name2")
names(df)[names(df) == "col"] = "name" # Rename particular column / can use c

# Renaming column values

df$col = factor(df$col, levels = c(0, 1), labels = c("no", "yes"))
df$col = ifelse(df$col == 0, "no", "yes")

# Filter / Logical df
logical_df = df == "some_value" # Element-wise comparison
# Mainly useful if want to calculate proport
# col against a denominator e.g. tut 6
desired_df = df[df$col == "some_value, ] # Get all columns with specified row va

subset(df, col == "some_value") # Can apply any logical operator

df %>%
filter(col == "some_value") # dplyr framework to do the same above

# Adding new column

df$new_col = ifelse(df$other_col == "some_value", a, b) # Adding new col based
# logical test o
# can be prim
df = df %>%
mutate(new_col = case_when(
other_col == value_1 ~ a,
other_col == value_2 ~ b
)

# Indexing df
df[1, 2] # Row 1, Column 2 (75)
df[ , 1] # All rows, column 1
df[1:2, ] # Rows 1 and 2, all columns

# Excluding elements
df[-1, ] # Exclude 1st row
df[ , -2] # Exclude 2nd column

R Cheatsheet 6
df[-c(1,3), -1] # Exclude rows 1 & 3, and column 1

# Attach (NOT RECCOMENDED)

attach(df) # Allows col names to be called as is without $ operator
df[col == "some_value"]

# Sums
rowSums(df) # Sums across rows
colSums(df) # Sums across cols

# Means
rowMeans(df)
colMeans(df)

EDA
Frequency Table & Barplot

# Frequency Table
wanted_freq = table(df$wanted_col)

# Proportion Table
[Link](wanted_freq)

# Percentage Table
[Link](wanted_freq) * 100

# Bar Graph
barplot(frequency_table, main = "Bar Plot Title",
xlab = "Responses", ylab = "Frequency", col = "some_colour")
### Some important params are xlim = c(x1, x2), ylim = c(y1, y2)

Done on categorical variables

R Cheatsheet 7
Summarise using:

Modal category

Proportion or percentage for modal category

Histogram & Boxplots

Presenting Histogram

# Read the CSV file (Assuming it's named '[Link]' and has a column named 'BM
data <- [Link]("[Link]")

# 1. Get numerical summaries

summary(data$BMI) # Provides min, 1st quartile, median, mean, 3rd quartile, ma
mean(data$BMI, [Link] = TRUE) # Mean
median(data$BMI, [Link] = TRUE) # Median
sd(data$BMI, [Link] = TRUE) # Standard deviation

# 2. Form a histogram by frequency

hist(data$BMI, main = "Histogram of BMI (Frequency)", xlab = "BMI", col = "lightb

# Form a histogram by probability (normalized)

hist(data$BMI, probability = TRUE, main = "Histogram of BMI (Probability)",
xlab = "BMI", col = "lightgreen")

# 3. Form a boxplot and check for outliers

boxplot(data$BMI, main = "Boxplot of BMI", ylab = "BMI", col = "lightcoral")

# Identify outliers manually

Q1 <- quantile(data$BMI, 0.25, [Link] = TRUE)
Q3 <- quantile(data$BMI, 0.75, [Link] = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Alternatively,*****

R Cheatsheet 8
[Link](data$bmi)

outliers <- data$BMI[data$BMI < lower_bound | data$BMI > upper_bound]

print("Outliers:")
print(outliers)

# 4. Comment on symmetry
skewness <- mean(data$BMI, [Link] = TRUE) - median(data$BMI, [Link] = TRUE)
if (skewness > 0) {
print("The BMI distribution is right-skewed (positively skewed).")
} else if (skewness < 0) {
print("The BMI distribution is left-skewed (negatively skewed).")
} else {
print("The BMI distribution is symmetric.")
}

##### EXTRA #####

# data is some quantitative variable
hist(data, breaks = 20, prob = <TRUE/FALSE>, main = "Some Title")

# Specifying breaks width / interval width

break_seq = seq(0, 200, by = 5) # 0 5 10 ... 195 200
hist(data, breaks = break_seq, prob = <TRUE/FALSE>, main = "Some Title")

# Specifying specific data, mark is some quantitative variable vector

# Or is called after attach(data)
hist(mark[mark > 30], breaks = 20, prob = <TRUE/FALSE>, main = "Some Title")

# Range
range(data) # Will include outliers

# Standard Deviation and Variance

sd(data)
var(data)

# IQR

R Cheatsheet 9
IQR(data)

# Outliers
outliers = df$col[df$col < lower_bound | df$col > upper_bound] # Or,

outliers = data %>%

filter(value < lower_bound | value > upper_bound) %>%
pull(value) # Or,

outliers = [Link](data)$out #### BEST AND EASIEST ####

outliers_index = which(data$col %in% c(outliers))

# Remove outliers
new_vec_no_outliers = df$col[-outliers_index] # col without outliers
new_df_no_outliers = df[-outliers_index, ] # dataframe without outliers

How to summarise histograms?

Overall Pattern:

Look for clusters or gaps

Identify any potential outliers (observations that deviate from the rest)

Unimodal vs. Multimodal:

Unimodal: Single peak or mound

Bell-Shape ⇒ Unimodal & Symmetric

Bimodal/Multimodal: Two or more peaks (modes)

Symmetry vs. Skewness:

Symmetric Distribution: Both sides mirror each other

Skewed Distribution: One tail is longer than the other

Presenting Boxplot

R Cheatsheet 10
# Load data from CSV (assuming column named 'BMI')
data <- [Link]("[Link]")

# 1. Get Numerical Summaries (5-number summary + mean, sd)

summary(data$BMI) # Provides min, 1st quartile, median, mean, 3rd quartile, ma
mean(data$BMI, [Link] = TRUE) # Mean of BMI
median(data$BMI, [Link] = TRUE) # Median of BMI
sd(data$BMI, [Link] = TRUE) # Standard deviation of BMI

# 2. Create a histogram for visualization

hist(data$BMI, main = "Histogram of BMI (Frequency)", xlab = "BMI", col = "lightb

# 3. Create a Boxplot to visualize the distribution of BMI

boxplot(data$BMI, main = "Boxplot of BMI", ylab = "BMI", col = "lightcoral")

# 4. Identify Outliers using IQR method

Q1 <- quantile(data$BMI, 0.25, [Link] = TRUE)
Q3 <- quantile(data$BMI, 0.75, [Link] = TRUE)
IQR_value <- Q3 - Q1 # Interquartile range

# Calculate lower and upper bounds for the whiskers

lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Alternative: Identify outliers using the [Link] function

outliers <- [Link](data$BMI)$out

# Print identified outliers

cat("Outliers identified: \n")
print(outliers)

# 5. Identify the Min and Max of the data that are **not outliers**
non_outliers <- data$BMI[data$BMI >= lower_bound & data$BMI <= upper_bound
cat("Min and Max of non-outlier data: \n")
cat("Min:", min(non_outliers), "\n")

R Cheatsheet 11
cat("Max:", max(non_outliers), "\n")

# 6. Comment on Symmetry (Skewness)

skewness <- mean(data$BMI, [Link] = TRUE) - median(data$BMI, [Link] = TRUE)
if (skewness > 0) {
cat("The BMI distribution is right-skewed (positively skewed).\n")
} else if (skewness < 0) {
cat("The BMI distribution is left-skewed (negatively skewed).\n")
} else {
cat("The BMI distribution is symmetric.\n")
}

Outliers and Skewness

Skewness: A distribution is skewed when one tail is longer than the other

Left Skewed (Negative Skew): Left tail is longer; data pulls to the left

Mean < Median

Right Skewed (Positive Skew): Right tail is longer; data pulls to the right

Mean > Median

Outliers: Skewness can indicate the presence of outliers, typically in the

longer tail

High Skew ? ⇒ Report Median & IQR

Low Skew / ~ Symmetric ? ⇒ Report Mean & Var / Spread

Median & IQR are more robust to outliers than Mean & Variance

Association Between 2 Variables

Contingency Table (2 Categorical Variables)

bc = [Link]("breast_cancer.csv")
# bbd: cancer status, 0 = no; 1 = yes
# pmh: post-menopausal hormone usage: 2 = no; 3 = yes

R Cheatsheet 12
# alcohol: amount of alcohol consumed
# hgt: height of participant
# agemenop: age of participant at menopause
# bmi: body mass index of participant

##### TWO CATEGORICAL VARIABLES #####

# Creating Contingency Table

table(bc$bbd, bc$pmh) # Usually exp variable then resp variable

#create new variables for bbd and pmh:

cancer <- ifelse(bc$bbd=="0","Absent","Present") # saved as vector

[Link] <- ifelse(bc$pmh == "2", "No", "Yes") # saved as vector

# replace the original values of bbd by new labels:

# bbd <- ifelse(bbd=="0","Absent","Present")

tab = table(cancer, [Link]) #[Link] in column

tab
proptab = [Link](tab)*100 # joint probabilities
proptab

tab = table([Link], cancer) # cancer in column

proptab = [Link](tab)*100 # joint probabilities

proptab

### CONDITIONAL PROBABILITIES:

# Taking [Link] as the denominator***

# create conditional percentage (condition on [Link])
proptab = [Link](tab, "[Link]")*100
proptab # total probabilities in EACH ROW (pmh use) is 100%

#or

R Cheatsheet 13
# Taking cancer as the denominator***
# create conditional percentage (condition on cancer)
proptab = [Link](tab, "cancer")*100
proptab # total probabilities in EACH COLUMN (cancer) is 100%

##### CLUSTERED BAR PLOT (has 'beside = TRUE') #####

# it plots the clusters = columns
# we need the conditional probability on [Link]
tab = table(cancer,[Link]) #[Link] in column
tab
proptab = [Link](tab, "[Link]")*100
proptab

# 2 clusters are formed by the column's categories, this is by default.

barplot(proptab, beside = TRUE)

barplot(proptab, beside = TRUE, xlab = "PMH Usage", main="",

col=c("darkblue","red"),legend = rownames(proptab), ylim = c(0, 70) )

tab = table([Link], cancer) #[Link] in column

proptab = [Link](tab, "[Link]")*100
barplot(proptab, beside = TRUE, xlab = "Cancer", main="",
col=c("darkblue","red"),legend = rownames(proptab), ylim = c(0, 70) )

##### STACKED BAR PLOT #####

tab = table(cancer, [Link]) #pmh in column

proptab = [Link](tab, "[Link]")*100 ; proptab

barplot(proptab, xlab = "PMH Usage", main="",col=c("darkblue","red"),

legend = rownames(proptab))

# each bar is 100% for each group of PMH.

R Cheatsheet 14
####### If cancer is in column then:

tab = table([Link], cancer)

tab

proptab = [Link](tab, "[Link]")*100

proptab

barplot(proptab, xlab = "Cancer", main="",col=c(2,5),legend = rownames(propta

#not recommended to interpret percentages when response var is in column

Barplot (1 Categorical Variable & 1 Quantitative Variable)

#bc$agemenop~cancer => This means you put the "agemenop" as the quantitati
#and "cancer" as the categorical variable i.e, y ~ x
boxplot(bc$agemenop~cancer, col = c(5,5), ylab = "Age at Menopause",
xlab = "Cancer Status")

# Alternatively,
boxplot(agemenop ~ cancer, data = data, col = c(5, 5), ylab = "Age at Menopaus
xlab = "Cancer Status")

boxplot(bc$agemenop~cancer)$out # values of all outliers for both boxplots

# identify the group of the outliers (1 = Absent or 2 = Present)

grp = boxplot(bc$agemenop~cancer)$group ; grp

# Tells you which of the outliers belong to values == 1

which(grp ==1) # index of outlier grp == 1 (Absent)
boxplot(bc$agemenop~cancer)$out[which(grp ==1)]# values of outliers in grp =

# Tells you which of the outliers belong to values == 2

which(grp ==2) # index of outlier in grp == 2 (Present)
boxplot(bc$agemenop~cancer)$out[which(grp ==2)]# values of outliers in grp =

R Cheatsheet 15
boxplot(bc$agemenop~cancer)$names

### Boxplot is most suitable to compare 1 quantitative and 1 categorical var

### To do this with histogram, we need to overlap them which is very difficult to

Scatterplot (2 Quantitative variables)

hdb = [Link]("hdb_2017_now.csv")
head(hdb)

#flat_type: 1 room, 2 room, 3 room, 4 room, 5 room, Excecutive and Multi-Genera

boxplot(hdb$resale_price~hdb$flat_type, col = c(2,3,4,5,6,7,8))

hist(hdb$resale_price[which(hdb$flat_type == "3 ROOM")], col = 2,

main = " ", xlab = "Price of 3 ROOM flat")

##### SCATTER PLOT #####

# 1st argument is x axis, 2nd argument is y axis

plot(hdb$floor_area_sqm, hdb$resale_price, col = 2)

# Syntatic Sugar but instead, it is y axis ~ x axis

# Get used to this one as it is used a lot in later chapters
# Read as y is "some combination" of x
plot(hdb$resale_price ~ hdb$floor_area_sqm, col = 5)

# For correlation, if there is no var, correlation = 0

# E.g. if 2 points have same y -value and different x-values and vice versa, there
# is no variance
# But if there are only 2 differing points, r is 1 or -1
# If symmetric, correlation = 0
# Usually remove outliers before this

R Cheatsheet 16
cor(hdb$floor_area_sqm, hdb$resale_price) # cor(x, y) == cor(y, x)

# Sometimes, we change the axes to ignore the outliers to prevent a "squashed l

# But if the y-axis does not start from 0, we will not be able to find out
# the y-intercept from the graph solely

Statistics
Sampling

# With replacement
sales_counts %>%
sample_n(2, replace = TRUE) # Allow duplicates

# Without replacement
sales_counts %>%
sample_n(2) # Don't allow duplicates

Discrete Distribution

# Create probability distribution

size_distribution <- restaurant_groups %>%
count(group_size) %>%

R Cheatsheet 17
mutate(probability = n / sum(n))

# Calculate probability of picking group of 4 or more

size_distribution %>%
# Filter for groups of 4 or larger
filter(group_size >= 4) %>%
# Calculate prob_4_or_more by taking sum of probabilities
summarize(prob_4_or_more = sum(probability))

Continuous Distribution
Probability distribution where all values in a given range [a, b] are equally
likely

# To calculate P(wait time <= 7)

punif(7, min = 0, max = 12)

# To calculate P(wait time >= 7)

punif(7, min = 0, max = 12, [Link] = F)

# To calculate P(4 <= wait time <= 7)

punif(7, min = 0, max = 12) - punif(4, min = 0, max = 12)

R Cheatsheet 18
# Create 1000 random numbers in interval
runif(1000, 0, 30)

Binomial Distribution
X ~ Bin(n, p) where n is the number of trials and p is probability of success

#### Simulating Random Outcomes

# Use rbinom(n, size, prob) to simulate random outcomes from a binomial distrib
# n -> number of random values to generate (observations)
# size -> number of trials (how many times an event is repeated)
# prob -> probability of success in each trial (between 0 and 1)

# Simulate 10 coin flips with 1 trial per flip, probability of heads = 0.5
# If there are x coin flips per trial, then size = x
random_outcomes = rbinom(10, size = 1, prob = 0.5)

#### Binomial Probability Mass Function

# Use dbinom(k, size, prob) to calculate the prob of exactly k successes in n trial
# k -> number of successes
# size -> number of trials
# prob -> probability of success in each trial

# Probability of getting exactly 3 heads in 5 coin flips with probability of heads =

binom_prob = dbinom(3, size = 5, prob = 0.5)

#### Cumulative Probability

# Use pbinom(k, size, prob) to calculate the cumulative probability P(X ≤ k)
# Can specify [Link] = F to calculate P(X > k)

# Cumulative probability of getting 3 or fewer heads in 5 coin flips (P(x <= 3))
cumulative_prob = pbinom(3, size = 5, prob = 0.5)

R Cheatsheet 19
# Cumulative probability of getting more than 3 heads in 5 coin flips (P(X > 3))
cumulative_prob_more_than_3 = pbinom(3, size = 5, prob = 0.5, [Link] = FALS

#### Expected Value

# The expected value for a binomial distribution is E(x) = np
# Where n -> number of trials, p -> probability of success

# Expected number of heads in 5 coin flips with a 0.5 probability of heads

expected_value = 5 * 0.5

Normal Distribution
X ~ N(μ, σ 2 )

##### Finding Values #####

# Use pnorm(value_to_test, mean, sd) to find percent of people that ‘fail’ value_to
# Calculate percent of women shorter than 154
pnorm(154, mean, sd) # P(X < 154)

# Calculate percent of women taller than 154

pnorm(154, mean, sd, [Link] = F) # P(X > 154)

# Calculate percent of women between 154 and 157

pnorm(157, mean, sd) - pnorm(154, mean, sd) # P(154 < X < 157)

# Use qnorm(quantile, mean, sd) to find the exact variable value (e.g., height) tha
# people ‘fail’
# E.g., height of women where 90% are shorter than
qnorm(0.9, mean, sd)

# E.g., height of women where 90% are taller than

qnorm(0.9, mean, sd, [Link] = F)

##### Simulating Outcomes #####

R Cheatsheet 20
# Use rnorm(number_of_observations_to_generate, mean, sd) to simulate random
# Simulate 36 sales with a specific mean and standard deviation
new_sales <- new_sales %>%
mutate(amount = rnorm(36, new_mean, new_sd))

### Generate random samples from a normal distribution

# N -> Number of samples (rows)
# C -> Sample size (columns)
# mu -> Mean of the normal distribution
# sd -> Standard deviation of the normal distribution
N_samples = matrix(rnorm(N * C, mu, sd), nrow = N, ncol = C)

# Calculate the mean of each sample (row)

[Link] = rowMeans(N_samples) # [Link] contains the sample means for each of th

##### Central Limit Theorem (CLT) #####

# To replicate an experiment, use replicate(amount_to_replicate, function_to_repl
# Simulate 10 experiments of taking the mean of 5 dice rolls
sample_means = replicate(10, sample(die, 5, replace = T) %>% mean())

# This is known as a sampling distribution

## Sampling Distribution
# This will approach closer to a normal distribution as the number of trials increa
# / approaches infinity
# This is known as the CLT (Central Limit Theorem)
# Only applies if samples are random and independent
# Large populations
# Don’t know what the actual distribution is

## Mean of Sampling Distribution

# mean(sample_means) gives you the expected value of the sample means x_ba
mean(sample_means)

R Cheatsheet 21
# mean(sample_proportions) gives you the expected proportion of the mean P_h
mean(sample_proportions)

Sampling Distribution of Proportion

Since don’t know anything about p i.e. population proportion, we replace it

with p
^i.e sample proportion

Sampling Distribution of Mean

xBut in practice / when using sample data,

t-Distribution

R Cheatsheet 22
# 1. Plotting the t-distribution with df = 10
x <- seq(-4, 4, length = 100) # Sequence of t-values
y <- dt(x, df = 10) # t-distribution PDF with df = 10

# Plot the t-distribution curve

plot(x, y, type = "l", main = "t-Distribution (df = 10)",
xlab = "t-value", ylab = "Density", col = "blue")

# 2. Finding the cumulative probability for a t-value (e.g., P(t < 2) for df = 10)
probability <- pt(2, df = 10)
cat("Probability that t < 2 (df = 10):", probability, "\n")

# 3. Finding the critical t-value for a given probability (e.g., 95% percentile)
# This takes in the quantile i.e. 1 - a / 2
critical_t_value <- qt(0.95, df = 10)
cat("Critical t-value for 95% confidence (df = 10):", critical_t_value, "\n")

# 4. Generating 100 random t-distributed values with df = 10

random_t_values <- rt(100, df = 10)

# Plotting the histogram of random t-values

hist(random_t_values, breaks = 20, main = "Random t-distributed Values (df = 10)
xlab = "t-values", col = "lightblue", border = "black")

# 5. Overlaying the theoretical t-distribution curve on the histogram

curve(dt(x, df = 10), col = "red", lwd = 2, add = TRUE)

Hypothesis Testing
One-Sample Hypothesis Test (Done in R code)

Two-Sample Hypothesis Test

R Cheatsheet 23
Testing Equal Variances

[Link](sample1, sample2)
# If p-value < 0.05, then unequal else equal

Independent T Test with Equal Variances

# Independent t-test with equal variances

# This is used when you assume equal variances between the two groups

# Example data for two independent samples

group1 <- c(5.1, 6.3, 5.8, 5.6, 6.2)
group2 <- c(7.1, 7.5, 7.3, 6.9, 7.2)

# Perform the t-test assuming equal variances

[Link](group1, group2, alternative = "[Link]", [Link] = TRUE)

# This returns the test statistic, p-value, and confidence interval

Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).

Alternative Hypothesis (H1): The means are not equal (μ1≠μ2).

When to Use: This test is used when the variances of the two groups are
assumed to be the same.

Independent T Test with Unequal Variances

R Cheatsheet 24
# Independent t-test with unequal variances (Welch's t-test)
# This is used when you assume unequal variances between the two groups

# Example data for two independent samples

group1 <- c(5.1, 6.3, 5.8, 5.6, 6.2)
group2 <- c(7.1, 7.5, 7.3, 6.9, 7.2)

# Perform the t-test assuming unequal variances

[Link](group1, group2, alternative = "[Link]", [Link] = FALSE)

# This returns the test statistic, p-value, degrees of freedom, and confidence inte

Null Hypothesis (H0): The means of both groups are equal (μ1=μ2).

Alternative Hypothesis (H1): The means are not equal (μ1≠μ2).

When to Use: This test is used when the variances of the two groups are
unequal (i.e., when the assumption of equal variances is violated).

Dependent T Test

# Paired t-test for dependent samples

# This is used when you have two related samples
# (e.g., pre-treatment and post-treatment data)

R Cheatsheet 25
# Example data for paired samples
pre_treatment <- c(55, 60, 65, 70, 75)
post_treatment <- c(58, 63, 66, 72, 78)

# Perform the paired t-test

[Link](pre_treatment, post_treatment, alternative = "[Link]", paired = TRUE)

# Alternatively,
[Link](diff, mu = 0, alaternatively = "two-sided")

# This returns the test statistic, p-value, and confidence interval

Null Hypothesis (H0): The mean difference between the paired observations
is zero (μD=0).

Alternative Hypothesis (H1): The mean difference is not zero (μD≠0).

When to Use: This test is used when the data is paired, meaning the two
samples are not independent but are related (e.g., measurements before and
after treatment).

Linear Regression
Simple Linear Regression
Only 1 regressor

Assumptions

R Cheatsheet 26
Data obtained by randomisation

y ~ x is linear

Check scatterplot and look at bands

Values of y stable as x increases?

ϵ ~ N (0, σ 2 )⇒ This is the error term

This essentially means constant variance assumption is met

We also implicitly assume that this error term is normal

Computed using error terms:

M1 = lm(y ~ x, data = data)

summary(M1)

# The col name here must match EXACTLY in the original data i.e. same as exp v
vals_to_predict = [Link](x = c(20, 40))
predict(M1, vals_to_predict)

# Sample data
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 4, 6)

# Plot the data points

plot(x, y, main = "Linear Regression", xlab = "x", ylab = "y", pch = 19)

### THIS IS A BIT OF EXTRA KNOWLEDGE BUT WOULD BE GOOD FOR EXAMS
abline(M1, col = "blue", lwd = 2)

R Cheatsheet 27
# Interval estimates
confint(M1, level = 0.95) # To get confint for all intercepts and regressors
confint(M1, 'col_name', level = 0.95) # To get confint for that regressor only
### It is known that the intercept is the mean response

t-tests for β1

Tests if a regressor or term is significant

R Cheatsheet 28
F-tests for β1

Tests if WHOLE model is significant

Null hypothesis: All coefficients are 0 i.e. model is insignificant

Alt hypothesis: At least one of the coefficients, except intercept, are non 0

In Simple linear regression model, hypothesis for F-tests == hypothesis for t-

tests

This is because there is only one regressor β1

df1 = # coefficients

df2 = n - (# coefficients + 1 i.e. from the intercept)

Regression Diagnostics
Randomization: From the steps of data collection

Linearity: can check this assumption using a scatter plot between response Y
and regressor X and the residuals plot

Normality: is checked using the residuals of the built model

Constant variance: is checked using the residuals of built model

Linearity

R Cheatsheet 29
In (2): Linearity is violated

Possible fix is to add higher order terms e.g. x2

In R, need to wrap around in I ⇒ lm(y ~ x + I(x^2), data = data)

In (3): Variance is not constant

Possible fix is to transform response var y

In R, lm(log(y) ~ x) or lm(sqrt(y) ~ x) or lm(1/y ~ x)

No need I term, log in R is automatically log e == ln

Only need these 3 transformations

Transformations will change the coefficient β1

Residual Plots

Used to check normality, constant variance, whether need to transform y or

add higher order term of x

R Cheatsheet 30
Most commonly used are Standardised Residuals or SR

# How to get SR
car = [Link]("C:/Data/[Link]")
attach(car)
M1 = lm(Selling_Price~Present_Price, data = car)
[Link] = M1$res # These are the raw residuals
SR = rstandard(M1) # These are the standard residuals

Necessary Plots

# SRs (y-axis) against y_i (on x-axis)

# Expect points to scatter randomly about 0 but be between -3 and 3
y.i = fitted(M1)
plot(SR ~ y.i)

# SR against X
# Expect points to scatter randomly about 0 but be between -3 and 3
plot(SR ~ x)

# SR histogram
# Expect to be normally distributed
hist(SR)

R Cheatsheet 31
# QQ plot of SR
qqnorm(SR)
qqline(SR)

Cooks Distance
Outliers are identified when SR > -3 or SR < -3

We need to see whether these points should be dropped or corrected

Influential Points
This is an outlier that affects the model parameters estimates greatly

Outliers may or may not be influential

If the outliers cooks distance is > 1, then it is considered influential

which(SR>3|SR<(-3)) # index of outliers

65 79 83 86 87 95
65 79 83 86 87 95

# If you want the outliers directly

data[data %in% outliers_index]

C = [Link](M1)
which(C>1) # index of influential point
87
87
# Hence, may try to drop the 87th point and fit the model again

Coefficient of Determination R2
Takes on value between 0 and 1

If there are repeated x values with different y values, R2 can never be 1

If it is equal to 1 ⇒ yi = y
^i ∀i

R Cheatsheet 32
Multiple R sq for simple linear
regression

Adj R sq for multiple linear

regression

R Cheatsheet 33
R Cheatsheet 34

Essential R Commands for Data Analysis
No ratings yet
Essential R Commands for Data Analysis
11 pages
R Programming Cheat Sheet Guide
No ratings yet
R Programming Cheat Sheet Guide
7 pages
Essential R Commands for Data Analysis
100% (1)
Essential R Commands for Data Analysis
2 pages
Essential R Commands for Data Analysis
No ratings yet
Essential R Commands for Data Analysis
8 pages
R Programming Basics and Data Analysis
No ratings yet
R Programming Basics and Data Analysis
16 pages
R Operations Reference Guide
100% (1)
R Operations Reference Guide
4 pages
R Programming Reference Card
No ratings yet
R Programming Reference Card
2 pages
R Studio Combined Notes
No ratings yet
R Studio Combined Notes
20 pages
R Programming Basics Guide
No ratings yet
R Programming Basics Guide
24 pages
R Language Cheat Sheet Overview
No ratings yet
R Language Cheat Sheet Overview
24 pages
R Cheat Sheet PDF
100% (1)
R Cheat Sheet PDF
38 pages
BA R StudyNotes
No ratings yet
BA R StudyNotes
32 pages
Essential R Commands and Functions Guide
No ratings yet
Essential R Commands and Functions Guide
2 pages
R Programming Cheat Sheet Guide
No ratings yet
R Programming Cheat Sheet Guide
2 pages
R Programming Basics and Data Manipulation
No ratings yet
R Programming Basics and Data Manipulation
20 pages
R Programming for Beginners: Statistics & Graphs
No ratings yet
R Programming for Beginners: Statistics & Graphs
2 pages
R Programming Variables and Operators Guide
No ratings yet
R Programming Variables and Operators Guide
8 pages
RStudio Basics for Data Management
No ratings yet
RStudio Basics for Data Management
44 pages
Base R Programming Cheat Sheet
No ratings yet
Base R Programming Cheat Sheet
2 pages
Understanding R Functions and Packages
100% (1)
Understanding R Functions and Packages
7 pages
R Base Cheat Sheet for Data Analysis
No ratings yet
R Base Cheat Sheet for Data Analysis
2 pages
Understanding R Data Types and Structures
No ratings yet
Understanding R Data Types and Structures
7 pages
R Data Analysis Basics and Techniques
No ratings yet
R Data Analysis Basics and Techniques
78 pages
Advanced Data Exploration Techniques
No ratings yet
Advanced Data Exploration Techniques
32 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
5 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
Notes W1L20
No ratings yet
Notes W1L20
6 pages
Chapter 1
No ratings yet
Chapter 1
30 pages
R Data Types and Structures Overview
No ratings yet
R Data Types and Structures Overview
7 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
29 pages
Creating R Matrices from Vectors
No ratings yet
Creating R Matrices from Vectors
13 pages
Basics of R
No ratings yet
Basics of R
13 pages
R Programming Basics and Data Structures
No ratings yet
R Programming Basics and Data Structures
14 pages
R Data Analysis with dplyr and ggplot2
No ratings yet
R Data Analysis with dplyr and ggplot2
2 pages
R Programming Data Transformation Guide
No ratings yet
R Programming Data Transformation Guide
23 pages
Data Analytics Basics with R
No ratings yet
Data Analytics Basics with R
27 pages
R Programming Basics and Data Types
No ratings yet
R Programming Basics and Data Types
15 pages
Assignment DADS301 MBA 3
No ratings yet
Assignment DADS301 MBA 3
17 pages
Data Science Techniques Using R
No ratings yet
Data Science Techniques Using R
38 pages
R Programming Basics and Functions Guide
No ratings yet
R Programming Basics and Functions Guide
3 pages
Ten R Data Analysis Tips Webinar
No ratings yet
Ten R Data Analysis Tips Webinar
16 pages
BA Practical Scripts
No ratings yet
BA Practical Scripts
18 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
13 pages
R Programming: Interfaces and Data Handling
No ratings yet
R Programming: Interfaces and Data Handling
22 pages
Understanding R Data Classes and Types
No ratings yet
Understanding R Data Classes and Types
45 pages
R Scripts for Data Analysis in R
No ratings yet
R Scripts for Data Analysis in R
25 pages
R Basics for Business Analytics
No ratings yet
R Basics for Business Analytics
4 pages
06 Module6 BE COMP ONLY DataAnalyticsWithR
No ratings yet
06 Module6 BE COMP ONLY DataAnalyticsWithR
81 pages
R Programming Basics and Data Structures
No ratings yet
R Programming Basics and Data Structures
15 pages
Data Analytics Lab Week 2 Guide
No ratings yet
Data Analytics Lab Week 2 Guide
26 pages
R Programming Basics for Beginners
No ratings yet
R Programming Basics for Beginners
53 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
15 pages
Network Analysis with R and igraph
No ratings yet
Network Analysis with R and igraph
62 pages
R Document Final
No ratings yet
R Document Final
47 pages
Introduction To R Programming
No ratings yet
Introduction To R Programming
24 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
53 pages
Introduction to R Programming Basics
No ratings yet
Introduction to R Programming Basics
101 pages
Essential R Packages and Functions Guide
No ratings yet
Essential R Packages and Functions Guide
9 pages
R Programming for Statistics & Visualization
No ratings yet
R Programming for Statistics & Visualization
19 pages
Plant Growth Analysis: ANOVA Results
No ratings yet
Plant Growth Analysis: ANOVA Results
15 pages
Dec 2016 Solution
No ratings yet
Dec 2016 Solution
7 pages
Design of Experiments Overview
No ratings yet
Design of Experiments Overview
13 pages
Wilcoxon Signed-Rank Test Guide
No ratings yet
Wilcoxon Signed-Rank Test Guide
22 pages
Hypothesis Testing and Error Types
No ratings yet
Hypothesis Testing and Error Types
12 pages
Howell's 9th Edition Stats Question Bank
No ratings yet
Howell's 9th Edition Stats Question Bank
22 pages
C.MED Research Viva Guide
No ratings yet
C.MED Research Viva Guide
16 pages
Data Science Skills and Concepts Quiz
No ratings yet
Data Science Skills and Concepts Quiz
4 pages
Sugarcane Population & Stalk CV Analysis
No ratings yet
Sugarcane Population & Stalk CV Analysis
9 pages
Mean, Median, Mode in Grouped Data
No ratings yet
Mean, Median, Mode in Grouped Data
11 pages
Understanding Random Samples and Statistics
No ratings yet
Understanding Random Samples and Statistics
3 pages
SPSS Missing Data Analysis Techniques
No ratings yet
SPSS Missing Data Analysis Techniques
92 pages
Comprehensive Data Science Course Schedule
No ratings yet
Comprehensive Data Science Course Schedule
1 page
JARS-Quant: Data Collection Guidelines
No ratings yet
JARS-Quant: Data Collection Guidelines
3 pages
Overview of Parametric Tests
100% (1)
Overview of Parametric Tests
11 pages
Sample Size Techniques for SEM Research
No ratings yet
Sample Size Techniques for SEM Research
10 pages
Kriging vs. Simulation, A 2D Map Example - GeostatsPy Well-Documented Demonstration Geostatistical Workflows
No ratings yet
Kriging vs. Simulation, A 2D Map Example - GeostatsPy Well-Documented Demonstration Geostatistical Workflows
16 pages
Research Methods for Mechanical Engineering
No ratings yet
Research Methods for Mechanical Engineering
130 pages
Quarterly Operating Expenses Analysis
No ratings yet
Quarterly Operating Expenses Analysis
25 pages
Experiment Analysis of Student Scores
No ratings yet
Experiment Analysis of Student Scores
20 pages
Hypothesis Testing in Engineering Analysis
No ratings yet
Hypothesis Testing in Engineering Analysis
36 pages
Understanding the Independent t-Test
No ratings yet
Understanding the Independent t-Test
34 pages
Analyzing Scatter Plot Associations
No ratings yet
Analyzing Scatter Plot Associations
29 pages
Statistical Quality Control Methods
No ratings yet
Statistical Quality Control Methods
2 pages
Instagram Use and Student Grades Analysis
No ratings yet
Instagram Use and Student Grades Analysis
6 pages
Stata Exercises for Microeconometrics
No ratings yet
Stata Exercises for Microeconometrics
2 pages
Understanding Statistics Basics
No ratings yet
Understanding Statistics Basics
8 pages
Statistics and Probability Q2/Q4 Guide
No ratings yet
Statistics and Probability Q2/Q4 Guide
3 pages
GDP's Impact on Employment Trends
No ratings yet
GDP's Impact on Employment Trends
14 pages
EBM Semester 1 Overview and Tips
No ratings yet
EBM Semester 1 Overview and Tips
96 pages