0% found this document useful (0 votes)
9 views10 pages

Stat 202-0 Notes

The document provides detailed notes on R programming, specifically focusing on data visualization and manipulation using ggplot2 and dplyr. It covers various plotting techniques such as scatterplots, line graphs, histograms, boxplots, and bar graphs, along with their respective syntax and options. Additionally, it includes functions for data wrangling, summarizing, and regression analysis, emphasizing the importance of tidy data and the interpretation of statistical models.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Stat 202-0 Notes

The document provides detailed notes on R programming, specifically focusing on data visualization and manipulation using ggplot2 and dplyr. It covers various plotting techniques such as scatterplots, line graphs, histograms, boxplots, and bar graphs, along with their respective syntax and options. Additionally, it includes functions for data wrangling, summarizing, and regression analysis, emphasizing the importance of tidy data and the interpretation of statistical models.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1/7/2026 Class:

Loading Data Packages:


-​ Use the function “library()” to load packages
Rendering:
-​ The button Render will create a document that includes content and output
-​ The “echo: false” option disables the printing of code, only displaying the output
Exporting:
-​ Render the document first
-​ Go to “files” pane and check the box next to the .html file
-​ In the “files” pane go to the gear sprocket and click “Export”
Syntax:
-​ When using “label, eval, message,” you must use #| before the line, otherwise it will
not work
-​ #| message: false (This is good)
-​ message: false (This is not good)
-​ These also must be put above any code/must be run first
-​ #| eval:
-​ False tells knitr to not run the code below, only displaying the code itself and
producing no output
-​ True tells knitr to run the code normally (default)
-​ #| message:
-​ False suppresses any messages generated by the code run
-​ True tells knitr to display any messages that pop up from the code run (default)
-​ #| label:
-​ Assigns a name to the code

RC 02 Notes:
Basic Grammar of Scatterplots:
-​ ggplot(data = faithful, mapping = aes(x = __, y = __)) +
geom_point()
Various aesthetic attributes:
-​ Size
-​ Color
-​ Shape
-​ Position of x and y variables
Misc Notes:
-​ In dataframes, rows correspond to observations, and columns correspond to variables
-​ Argument refers to input to a function
-​ The + sign in ggplot() adds a layer to the plot, and not using the + sign to add a
geometric object will result in an empty plot
-​ Alpha is the aesthetic argument that allows you to change the transparency of a
geometric object
1/9/26 Class Notes:
-​ The glimpse function gives you a basic overview of the data in a dataset
Scatterplot Syntax:
-​ #| label: scatter-legos
ggplot(data = legosets, mapping = aes(x = Pieces, y = USD_MSRP))+
geom_point(alpha = .5)
-​ To alter color, use Colour = (var name) in the “mapping = aes()” chunk
-​ To use jittering, use geom_point(position = "jitter")
Misc. Notes:
-​ A scatterplot only takes numeric vs numeric variables, and visualizes Y vs X
-​ Describing a scatterplot:
-​ Association
-​ Strength
-​ Patterns
RC 03 Notes (Linegraphs/Histograms):
Filter()
-​ The function filter() filters out rows that we don’t need
Linegraph Syntax:
-​ ggplot(data = early_january_weather, mapping = aes(x = time_hour, y = temp))+
​ geom_line()
-​ geom_line() creates the line on the graph
Histogram Syntax:
ggplot(data = weather, mapping = aes(x = temp))+
geom_histogram(color = "white", fill = "steelblue")
-​ geom_histogram() creates the histogram graph
-​ Color = “white” adds colored borders around the bins
-​ Fill = “steelblue” colors in the bars
-​ Bins = 40 adjusts the amount of bins
-​ Alternatively, we can use binwith = 10 to adjust the fatness of the bins
-​ You cannot use both
-​ REMEMBER HISTOGRAMS TAKE ONLY ONE INPUT
Histogram Notes:
-​ R usually uses 30 bins
Faceting:
-​ We use faceting when we want to split a particular visualization of variables by another
variable
-​ facet_wrap(~variable) helps us facet our graph
-​ Using nrow = 4 and ncol = 4 we can specify the number of rows and columns
1/12/2026 Class Notes:
-​ View(), to view data, must be capitalized (for some reason)
-​ When using ggplot, you acted data = and mapping =
-​ R knows your first and second arguments will be these, so you can skip this step
-​ We have two numeric variables and X is a sequential order (often time)
Describing Data:
-​ Classify association(positive negative, none)
-​ Look for patterns
-​ Linear/non linear/constant
Specifying Layers:
ggplot(data = yearly_legosets)+
geom_line(aes(x = Year, y = mean_usd_msrp))+
geom_point(aes(x = Year, y = mean_usd_msrp))
-​ When we put the aes line in the local layers, those variables are only used by that line of
code
-​ When we put the aes line in the global layers, those variables are used by every line of
code thereafter
Linegraph Effects:
-​ Using linetype = “dashed”, we can choose what type of line we want
-​ Using linewidth = “.5”, we can change the thickness of the line
Histogram Notes:
-​ Describing a histogram
-​ center (peaks or mean)
-​ spread (range or standard deviation)
-​ shape skew (left skewed, right skewed, symmetric) (might used the asymmetric if
multimodal)
-​ You want the reader to be able to picture your graph from the histogram
Faceting Notes:
ggplot(mod_movie_lengths, aes(x = length))+
geom_histogram(color = "white", bins = 40)+
facet_wrap("before_1984", ncol = 1, scales = "free_y", [Link] = "right")
-​ Ncol determines how the graphs are displayed (over under vs side-by-side)
-​ Scales = free_y allows the y value to vary based on the data (Y value only goes up to
1000 instead of 2000)
-​ [Link] = “right” chooses where to put the label for the graph
RT 04 Notes (Boxplots and Bargraphs)
Boxplots:
-​ Same syntax as everything else for ggplot, except geom_boxplot() is used
ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) +
geom_boxplot()
-​ Additionally, boxplots are y by x like scatterplots
-​ Five number summary included in a boxplot:
-​ Median
-​ Maximum
-​ Minimum
-​ Third quantile (Q3, 75th percentile)
-​ First Quantile (Q1, 25th percentile)
-​ Boxplot vocabulary:
-​ Whiskers: Lines extending from the box to points less than the 25th percentile or
greater than the 75th percentile
-​ Length: IQR range (Measure of the spread of the data
-​ Outliers: Dots
-​ Box: 1st quartile, median, 3rd quartile, middle 50% of data
Bar Graphs:
-​ geom_bar() is used for categorical data that is NOT pre-counted
-​ geom_col() is used for categorical data that IS pre-counted

1/14/2026 Notes:
Syntax Notes:
-​ Factor() takes binary numeric values and turns them into True/False
Boxplot Notes:
-​ Maps side-by-side a numeric variable by a categorical variable OR a single numeric
variable
-​ xlim and ylim allow us to set the amount of values the x/y ranges show
-​ xlim (0, 1000)
-​ Describing a boxplot:
-​ Center (Median)
-​ Spread (IQR maybe range)
-​ Shape (Skew)
Stacked Barplot:
-​ By putting another variable on “fill = “variable”, we can create a stacked barplot
-​ This must be in aes()
-​ By putting geom_bar(position = “fill”), we can use “percentage out of 100” to help
visualize proportion compared to numbers
Side-by-Side Barplot:
-​ Putting geom_bar(position = “dodge”) creates a side-by-side barplot

RT 05
Quick Recap:
-​ scatterplots via geom_point()
-​ linegraphs via geom_line()
-​ boxplots via geom_boxplot()
-​ histograms via geom_histogram()
-​ barplots via geom_bar() or geom_col()
Other Functions:
-​ filter() filters existing rows to only pick out a subset of them
-​ summarize() summarizes one of its columns/variables with a summary statistic
-​ group_by() groups data by rows
-​ mutate() takes existing columns/variables and creates new ones
-​ arrange() allows us to choose how rows are displayed
Pipe Operator:
-​ %>% is takes one output and “pipes” it in to be the input for another line of code
alaska_flights <- flights %>%
filter(carrier == "AS")
summarize() Variables:
summary_temp <- weather %>%
summarize(mean = mean(temp, [Link] = TRUE),
std_dev = sd(temp, [Link] = TRUE))
-​ [Link] removes any missing “NA” data
-​ Misc. summarize variables:
-​ mean(): the mean AKA the average
-​ sd(): the standard deviation, which is a measure of spread
-​ min() and max(): the minimum and maximum values respectively
-​ IQR(): Interquartile range
-​ sum(): the sum
-​ n(): A count of the number of rows/observations in each group
Class 1/16/26 Notes:
%>% short hand:
-​ Ctrl + Shift + M types out %>%
Summarize() Notes:
summarize(mean_air = mean(air_time, [Link] = TRUE )
-​ We can set “mean(air_time)” to a variable, “mean_air”
-​ [Link] goes in mean()
-​ Remember when using multiple summary functions, DO NOT use multiple summarize
variables
ord_summary <- ord_flights %>%
summarize(
mean = mean(air_time, [Link] = TRUE),
sd_air = sd(air_time, [Link] = TRUE),
n = n()
)
-​ Notice how all the summary functions are in one summary()
Quick Recap:
-​ filter : keep only observations/rows that meet a criteria
-​ summarize: reduce the data frame to a summary of specified calculations
Filter() Alternatives:
-​ ![Link]()
-​ drop_na()
-​ These two filter out observations instead of outputs
ord_not_cancelled <- ord_flights %>%
drop_na(air_time)
Filter Notes:
-​ When using filter, we can create multiple criteria with &/or (, |)
RT 06 3.4-3.9:
group_by():
-​ If we want to compute summary statistics based on a categorical variable instead of for
the entire dataset, we can use the group_by() function
group_by(month) %>%

-​ If we pipe our data into this, it will create a dataframe where the data is sorted by month
(month is a variable in the “weather” dataframe)
-​ You are not limited to grouping by one variable. You can group by multiple variables
within the same group_by function
new_data <- data %>%
group_by(var1, var2) %>%
-​ Remember we can also use ungroup() to ungroup things
Mutate():
-​ Mutate() allows us to change data (convert celsius into fahrenheit) and “create a new
variable”
weather <- weather %>%
mutate(temp_in_C = (temp-32)/1.8)
-​ We can also use multiple variables when we mutate:
flights <- flights %>%
mutate(gain = dep_delay - arr_delay)
Arrange():
-​ Arrange() allows us to sort/reorder a dataframe according to the values of a specified
variable
freq_dest <- flights %>%
group_by(dest) %>%
summarize(num_flights = n())
-​ desc() changes the order from ascending to descending
Join():
-​ Join() allows us to merge two datasets
flights_joined <- flights %>%
inner_join(airlines, by = "carrier")
Misc. Verbs:
-​ select() only a subset of variables/columns
-​ rename() variables/columns to have new names
-​ Return only the top_n() values of a variable
Select():
-​ If we only need some of the variables, we can use select() to create a dataframe of just
some of the variables
-​ Remove a certain variable:
flights_no_year <- flights %>%
select(-year)
-​ Take a range of variables:
flight_arr_times <- flights %>%
select(month:day, arr_time:sched_arr_time)
Rename():
-​ Rename() renames a variable because why not
flights_time <- flights %>%
​ select(contains("time")) %>%
rename(departure_time = dep_time,
arrival_time = arr_time)
Slice():
-​ Slice_max() and Slice_min() return the maximum and minimum values of a variable
​ named_dests %>%
slice_max(n = 10, order_by = num_flights)
Table I’m too lazy to create:

1/21/2026 Notes:
Review:
-​ Filter() always uses ==
Group_by():
-​ Group_by() goes before summarize()
-​ Group_by() also takes a categorical variable
Arrange():
-​ Arrange() goes at the end of your code
-​ Make sure to pipe in the newly created code, not any original variables
RT 07 Notes: Tidy Data
Filetypes:
-​ Idk there’s some notes on filetypes, don’t really get it but its there ig
Tidy Data:
-​ Tidy Data is data in R that follows a standardized format
-​ Each variable forms a column
-​ Each observation forms a row
-​ Each type of observational unit forms a table
RC 08 Basic Regression:
Variable Relationships:
-​ Y is an outcome variable, also called a dependent variable
-​ X is an explanatory/predictor/independent variable
Packages Needed:
-​ Ggplot2 for data visualization
-​ Dplyr for data wrangling
-​ Tidyr for converting data to “tidy” format
-​ Readr for importing spreadsheet data into R
-​ Skimr for computing summary statistics
Skim():
-​ Skim() returns univariate summary statistics, or functions that take a single variable and
return some numerical summary of that variable
-​ We’ll be given a number that represents the strength of the linear relationship between
two numerical variables, between -1 and 1
Simple linear regression:

1/26/2026
Loading in Data:
state_sat <- read_csv("data/state_sat.csv")
Skimr():
-​ Skim()/skim_without_charts() allows us to see a summary of the data
Describing a Dataset:
-​ Missingness of Variables
-​ Number of region levels/categories
-​ Number of division levels/categories
-​ Is the data reasonable?
Geom_smooth():
ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
geom_point(shape = 1) +
geom_smooth(method = "lm", se = FALSE)+
-​ Adds a line of best fit to the data
Finding Correlation:
summarize(correlation = cor(score, bty_avg))
OR
state_sat %>%
select(teach_pay, sat_math) %>%
cor()
-​ Cor() does output a matrix instead of just a double
Coefficient Interpretations:
-​ b0 (intercept): The expected/predicted value of "y" when "x" is equal to 0.
-​ b1 (slope): For every 1 unit increased in "x", we predict "y" to increase/decrease on
average by "b1".
Fit a Model:
model_math_pay <- lm(sat_math ~ teach_pay , data = state_sat)
summary(model_math_pay)
Using Residuals/fitted:
state_sat2 <- state_sat %>%
select(state, teach_pay, sat_math) %>%
mutate(
sat_math_hat = fitted(model_math_pay),
residual = residuals(model_math_pay)
) %>% filter(state == "IL" | state == "FL")
RT 10 6.0-6.1:
Interaction Model:
-​ A way to quantify the relationship between the outcome variable and two explanatory
variables

-​ Syntax:
ggplot(state_sat, aes(x = pct_taking, y = sat_math, color = main_exam))+
geom_point() +
geom_smooth(method ="lm", se = FALSE)
Parallel Slopes Model:
-​ A parallel slopes model allows for difference intercepts but forces all lines of have the
same slope, creating parallel lines
-​ Syntax:
ggplot(state_sat, aes(x = pct_taking, y = sat_math, color = main_exam))+
geom_point()+
geom_parallel_slopes(se=FALSE)
RT 11 2/1/2026:
For the love of god please remember:
-​ fitted() gives you the fitted values of a model
-​ residuals() gives you the residuals of a model
Syntax:
debt_model_data <- credit_ch6 %>%
select(debt, credit_limit, income) %>%
mutate(debt_hat = fitted(debt_model),
residual = residuals(debt_model)) %>%
rownames_to_column("ID")

Common questions

Powered by AI

In ggplot2, the geom_smooth() function adds a smoothed line to a scatterplot, often representing a trend or pattern in the data. It is typically used to fit lines, such as linear models or other trends, to the data points in the plot. This can help viewers see overall patterns or trends in the data despite variability or outliers. For example, adding geom_smooth(method = "lm", se = FALSE) would overlay a linear model on the scatterplot without showing the confidence interval, thereby providing a simple trend line to highlight the relationship between variables .

Standardized data formats like 'tidy data' are crucial in R as they facilitate efficient data manipulation and analysis. Characteristics of tidy data include each variable forming a column, each observation forming a row, and each type of observational unit forming a table. This format enhances compatibility with R functions and libraries, such as dplyr and ggplot2, enabling streamlined data processing and visualization pipelines. It also aids in reducing errors, improving readability and maintainability of code .

The geom_bar() function in ggplot2 is used to create bar graphs for categorical data where counts are calculated automatically through the function. It is appropriate for data that has not been previously summarized and provides a visualization of the frequency of categories. Conversely, geom_col() is designed for use with data that has been pre-counted or summarized, and directly maps a variable to the bar height, thus is used when the exact values are stored in the dataset and should be visualized. Therefore, geom_col() should be used when data is already processed and ready for direct mapping, while geom_bar() handles calculation of counts from raw data .

Piping using %>% in R offers significant advantages in data manipulation workflows by allowing for clear, concise, and readable code. It enables the chaining of multiple operations in a sequential manner, reducing the need for nested functions or temporary variables. This enhances workflow efficiency and makes it easier to follow the sequence of operations being performed, which is beneficial for debugging and collaborative coding environments .

Faceting in ggplot2 is a technique used to create multiple plots by splitting the data into subsets based on one or more categorical variables. This allows for side-by-side comparison of plots across levels of the categorical variable. It enhances visualization by breaking down data into more digestible, visually friendly parts, making pattern recognition across different conditions easier. For instance, using facet_wrap(~variable), you can create separate plots for each level of a variable, such as before_1984, with ncol = 1 to display them in a single column, adjusting the scales individually using scales = 'free_y' .

The mutate() function in the tidyverse package is used to create new variables or transform existing variables within a dataframe. It enables the user to perform arithmetic or apply functions to existing columns, generating new columns with the results. For example, you can convert temperature from Fahrenheit to Celsius by using mutate(temp_in_C = (temp-32)/1.8), or create a new column representing the gain in time as the difference between departure delay and arrival delay .

Using the filter() function incorrectly, such as not accounting for NA values or incorrectly setting criteria logic, can lead to wrong subsets, significantly affecting the analysis results. Alternate functions like !is.na() and drop_na() can mitigate these errors by ensuring that NAs are appropriately handled or removed before the filtering operation proceeds. This helps in maintaining data integrity when performing operations that assume complete datasets or specific criteria adherence .

The 'echo: false' option in R Markdown suppresses the printing of code chunks, displaying only the output in the rendered document. This is useful for hiding code when the focus is on the results rather than the code that produced them. It allows users to create reports that are cleaner and more focused on data analysis outputs rather than the underlying code .

The group_by() function in dplyr is used to create groups within the data based on one or multiple categorical variables. When combined with summarize(), it allows the calculation of summary statistics within each group separately instead of across the entire dataset. By grouping data, you can calculate metrics such as mean, count, or standard deviation for each category independently, providing insights into trends or comparisons between different subsets of data. This grouped data manipulation is essential for nuanced data analysis and requires careful use to ensure accurate results .

In linear regression, the coefficients include the intercept (b0) and the slope (b1) for each predictor variable. The intercept b0 represents the expected value of the dependent variable when all predictors are zero, offering a baseline when no predictor effect exists. The slope b1 indicates the average change in the dependent variable for a one-unit increase in the predictor variable, holding other variables constant. This helps in understanding the direct influence of each predictor on the outcome variable, aiding in insights on how variances affect results .

You might also like