1/7/2026 Class:
Loading Data Packages:
- Use the function “library()” to load packages
Rendering:
- The button Render will create a document that includes content and output
- The “echo: false” option disables the printing of code, only displaying the output
Exporting:
- Render the document first
- Go to “files” pane and check the box next to the .html file
- In the “files” pane go to the gear sprocket and click “Export”
Syntax:
- When using “label, eval, message,” you must use #| before the line, otherwise it will
not work
- #| message: false (This is good)
- message: false (This is not good)
- These also must be put above any code/must be run first
- #| eval:
- False tells knitr to not run the code below, only displaying the code itself and
producing no output
- True tells knitr to run the code normally (default)
- #| message:
- False suppresses any messages generated by the code run
- True tells knitr to display any messages that pop up from the code run (default)
- #| label:
- Assigns a name to the code
RC 02 Notes:
Basic Grammar of Scatterplots:
- ggplot(data = faithful, mapping = aes(x = __, y = __)) +
geom_point()
Various aesthetic attributes:
- Size
- Color
- Shape
- Position of x and y variables
Misc Notes:
- In dataframes, rows correspond to observations, and columns correspond to variables
- Argument refers to input to a function
- The + sign in ggplot() adds a layer to the plot, and not using the + sign to add a
geometric object will result in an empty plot
- Alpha is the aesthetic argument that allows you to change the transparency of a
geometric object
1/9/26 Class Notes:
- The glimpse function gives you a basic overview of the data in a dataset
Scatterplot Syntax:
- #| label: scatter-legos
ggplot(data = legosets, mapping = aes(x = Pieces, y = USD_MSRP))+
geom_point(alpha = .5)
- To alter color, use Colour = (var name) in the “mapping = aes()” chunk
- To use jittering, use geom_point(position = "jitter")
Misc. Notes:
- A scatterplot only takes numeric vs numeric variables, and visualizes Y vs X
- Describing a scatterplot:
- Association
- Strength
- Patterns
RC 03 Notes (Linegraphs/Histograms):
Filter()
- The function filter() filters out rows that we don’t need
Linegraph Syntax:
- ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp))+
geom_line()
- geom_line() creates the line on the graph
Histogram Syntax:
ggplot(data = weather, mapping = aes(x = temp))+
geom_histogram(color = "white", fill = "steelblue")
- geom_histogram() creates the histogram graph
- Color = “white” adds colored borders around the bins
- Fill = “steelblue” colors in the bars
- Bins = 40 adjusts the amount of bins
- Alternatively, we can use binwith = 10 to adjust the fatness of the bins
- You cannot use both
- REMEMBER HISTOGRAMS TAKE ONLY ONE INPUT
Histogram Notes:
- R usually uses 30 bins
Faceting:
- We use faceting when we want to split a particular visualization of variables by another
variable
- facet_wrap(~variable) helps us facet our graph
- Using nrow = 4 and ncol = 4 we can specify the number of rows and columns
1/12/2026 Class Notes:
- View(), to view data, must be capitalized (for some reason)
- When using ggplot, you acted data = and mapping =
- R knows your first and second arguments will be these, so you can skip this step
- We have two numeric variables and X is a sequential order (often time)
Describing Data:
- Classify association(positive negative, none)
- Look for patterns
- Linear/non linear/constant
Specifying Layers:
ggplot(data = yearly_legosets)+
geom_line(aes(x = Year, y = mean_usd_msrp))+
geom_point(aes(x = Year, y = mean_usd_msrp))
- When we put the aes line in the local layers, those variables are only used by that line of
code
- When we put the aes line in the global layers, those variables are used by every line of
code thereafter
Linegraph Effects:
- Using linetype = “dashed”, we can choose what type of line we want
- Using linewidth = “.5”, we can change the thickness of the line
Histogram Notes:
- Describing a histogram
- center (peaks or mean)
- spread (range or standard deviation)
- shape skew (left skewed, right skewed, symmetric) (might used the asymmetric if
multimodal)
- You want the reader to be able to picture your graph from the histogram
Faceting Notes:
ggplot(mod_movie_lengths, aes(x = length))+
geom_histogram(color = "white", bins = 40)+
facet_wrap("before_1984", ncol = 1, scales = "free_y", [Link] = "right")
- Ncol determines how the graphs are displayed (over under vs side-by-side)
- Scales = free_y allows the y value to vary based on the data (Y value only goes up to
1000 instead of 2000)
- [Link] = “right” chooses where to put the label for the graph
RT 04 Notes (Boxplots and Bargraphs)
Boxplots:
- Same syntax as everything else for ggplot, except geom_boxplot() is used
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
- Additionally, boxplots are y by x like scatterplots
- Five number summary included in a boxplot:
- Median
- Maximum
- Minimum
- Third quantile (Q3, 75th percentile)
- First Quantile (Q1, 25th percentile)
- Boxplot vocabulary:
- Whiskers: Lines extending from the box to points less than the 25th percentile or
greater than the 75th percentile
- Length: IQR range (Measure of the spread of the data
- Outliers: Dots
- Box: 1st quartile, median, 3rd quartile, middle 50% of data
Bar Graphs:
- geom_bar() is used for categorical data that is NOT pre-counted
- geom_col() is used for categorical data that IS pre-counted
1/14/2026 Notes:
Syntax Notes:
- Factor() takes binary numeric values and turns them into True/False
Boxplot Notes:
- Maps side-by-side a numeric variable by a categorical variable OR a single numeric
variable
- xlim and ylim allow us to set the amount of values the x/y ranges show
- xlim (0, 1000)
- Describing a boxplot:
- Center (Median)
- Spread (IQR maybe range)
- Shape (Skew)
Stacked Barplot:
- By putting another variable on “fill = “variable”, we can create a stacked barplot
- This must be in aes()
- By putting geom_bar(position = “fill”), we can use “percentage out of 100” to help
visualize proportion compared to numbers
Side-by-Side Barplot:
- Putting geom_bar(position = “dodge”) creates a side-by-side barplot
RT 05
Quick Recap:
- scatterplots via geom_point()
- linegraphs via geom_line()
- boxplots via geom_boxplot()
- histograms via geom_histogram()
- barplots via geom_bar() or geom_col()
Other Functions:
- filter() filters existing rows to only pick out a subset of them
- summarize() summarizes one of its columns/variables with a summary statistic
- group_by() groups data by rows
- mutate() takes existing columns/variables and creates new ones
- arrange() allows us to choose how rows are displayed
Pipe Operator:
- %>% is takes one output and “pipes” it in to be the input for another line of code
alaska_flights <- flights %>%
filter(carrier == "AS")
summarize() Variables:
summary_temp <- weather %>%
summarize(mean = mean(temp, [Link] = TRUE),
std_dev = sd(temp, [Link] = TRUE))
- [Link] removes any missing “NA” data
- Misc. summarize variables:
- mean(): the mean AKA the average
- sd(): the standard deviation, which is a measure of spread
- min() and max(): the minimum and maximum values respectively
- IQR(): Interquartile range
- sum(): the sum
- n(): A count of the number of rows/observations in each group
Class 1/16/26 Notes:
%>% short hand:
- Ctrl + Shift + M types out %>%
Summarize() Notes:
summarize(mean_air = mean(air_time, [Link] = TRUE )
- We can set “mean(air_time)” to a variable, “mean_air”
- [Link] goes in mean()
- Remember when using multiple summary functions, DO NOT use multiple summarize
variables
ord_summary <- ord_flights %>%
summarize(
mean = mean(air_time, [Link] = TRUE),
sd_air = sd(air_time, [Link] = TRUE),
n = n()
)
- Notice how all the summary functions are in one summary()
Quick Recap:
- filter : keep only observations/rows that meet a criteria
- summarize: reduce the data frame to a summary of specified calculations
Filter() Alternatives:
- ![Link]()
- drop_na()
- These two filter out observations instead of outputs
ord_not_cancelled <- ord_flights %>%
drop_na(air_time)
Filter Notes:
- When using filter, we can create multiple criteria with &/or (, |)
RT 06 3.4-3.9:
group_by():
- If we want to compute summary statistics based on a categorical variable instead of for
the entire dataset, we can use the group_by() function
group_by(month) %>%
- If we pipe our data into this, it will create a dataframe where the data is sorted by month
(month is a variable in the “weather” dataframe)
- You are not limited to grouping by one variable. You can group by multiple variables
within the same group_by function
new_data <- data %>%
group_by(var1, var2) %>%
- Remember we can also use ungroup() to ungroup things
Mutate():
- Mutate() allows us to change data (convert celsius into fahrenheit) and “create a new
variable”
weather <- weather %>%
mutate(temp_in_C = (temp-32)/1.8)
- We can also use multiple variables when we mutate:
flights <- flights %>%
mutate(gain = dep_delay - arr_delay)
Arrange():
- Arrange() allows us to sort/reorder a dataframe according to the values of a specified
variable
freq_dest <- flights %>%
group_by(dest) %>%
summarize(num_flights = n())
- desc() changes the order from ascending to descending
Join():
- Join() allows us to merge two datasets
flights_joined <- flights %>%
inner_join(airlines, by = "carrier")
Misc. Verbs:
- select() only a subset of variables/columns
- rename() variables/columns to have new names
- Return only the top_n() values of a variable
Select():
- If we only need some of the variables, we can use select() to create a dataframe of just
some of the variables
- Remove a certain variable:
flights_no_year <- flights %>%
select(-year)
- Take a range of variables:
flight_arr_times <- flights %>%
select(month:day, arr_time:sched_arr_time)
Rename():
- Rename() renames a variable because why not
flights_time <- flights %>%
select(contains("time")) %>%
rename(departure_time = dep_time,
arrival_time = arr_time)
Slice():
- Slice_max() and Slice_min() return the maximum and minimum values of a variable
named_dests %>%
slice_max(n = 10, order_by = num_flights)
Table I’m too lazy to create:
1/21/2026 Notes:
Review:
- Filter() always uses ==
Group_by():
- Group_by() goes before summarize()
- Group_by() also takes a categorical variable
Arrange():
- Arrange() goes at the end of your code
- Make sure to pipe in the newly created code, not any original variables
RT 07 Notes: Tidy Data
Filetypes:
- Idk there’s some notes on filetypes, don’t really get it but its there ig
Tidy Data:
- Tidy Data is data in R that follows a standardized format
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
RC 08 Basic Regression:
Variable Relationships:
- Y is an outcome variable, also called a dependent variable
- X is an explanatory/predictor/independent variable
Packages Needed:
- Ggplot2 for data visualization
- Dplyr for data wrangling
- Tidyr for converting data to “tidy” format
- Readr for importing spreadsheet data into R
- Skimr for computing summary statistics
Skim():
- Skim() returns univariate summary statistics, or functions that take a single variable and
return some numerical summary of that variable
- We’ll be given a number that represents the strength of the linear relationship between
two numerical variables, between -1 and 1
Simple linear regression:
1/26/2026
Loading in Data:
state_sat <- read_csv("data/state_sat.csv")
Skimr():
- Skim()/skim_without_charts() allows us to see a summary of the data
Describing a Dataset:
- Missingness of Variables
- Number of region levels/categories
- Number of division levels/categories
- Is the data reasonable?
Geom_smooth():
ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
geom_point(shape = 1) +
geom_smooth(method = "lm", se = FALSE)+
- Adds a line of best fit to the data
Finding Correlation:
summarize(correlation = cor(score, bty_avg))
OR
state_sat %>%
select(teach_pay, sat_math) %>%
cor()
- Cor() does output a matrix instead of just a double
Coefficient Interpretations:
- b0 (intercept): The expected/predicted value of "y" when "x" is equal to 0.
- b1 (slope): For every 1 unit increased in "x", we predict "y" to increase/decrease on
average by "b1".
Fit a Model:
model_math_pay <- lm(sat_math ~ teach_pay , data = state_sat)
summary(model_math_pay)
Using Residuals/fitted:
state_sat2 <- state_sat %>%
select(state, teach_pay, sat_math) %>%
mutate(
sat_math_hat = fitted(model_math_pay),
residual = residuals(model_math_pay)
) %>% filter(state == "IL" | state == "FL")
RT 10 6.0-6.1:
Interaction Model:
- A way to quantify the relationship between the outcome variable and two explanatory
variables
- Syntax:
ggplot(state_sat, aes(x = pct_taking, y = sat_math, color = main_exam))+
geom_point() +
geom_smooth(method ="lm", se = FALSE)
Parallel Slopes Model:
- A parallel slopes model allows for difference intercepts but forces all lines of have the
same slope, creating parallel lines
- Syntax:
ggplot(state_sat, aes(x = pct_taking, y = sat_math, color = main_exam))+
geom_point()+
geom_parallel_slopes(se=FALSE)
RT 11 2/1/2026:
For the love of god please remember:
- fitted() gives you the fitted values of a model
- residuals() gives you the residuals of a model
Syntax:
debt_model_data <- credit_ch6 %>%
select(debt, credit_limit, income) %>%
mutate(debt_hat = fitted(debt_model),
residual = residuals(debt_model)) %>%
rownames_to_column("ID")