R Programming Lecture Overview
R Programming Lecture Overview
Lecture Review
Dr. Kalyan N
Assistant Professor
Dept. of CSE (Data Science)
B.M.S College of Engineering
Bengaluru - 560019.
[Link]@[Link]
Homepage
October, 2024.
Programming With R
Contents
Module - 1
Lab Programs
• Step 1: Install R
– Go to the official R website: [Link]
– Choose your operating system (Windows, macOS, or Linux).
– Download the appropriate R installer.
– Run the installer and follow the instructions.
• Step 2: Install RStudio
– Visit the RStudio website: [Link]
– Download the RStudio installer for your operating system.
– Run the installer and follow the prompts to complete the installation.
1.2 Write and execute your first R script that includes basic arithmetic operations,
variable assignments, and printing results.
Once R and RStudio are installed, you can write your first R script. Below is a simple R script that includes basic
arithmetic operations, variable assignments, and printing results:
1 # Basic Arithmetic Operations
2 sum <- 10 + 5
3 difference <- 10 - 5
4 product <- 10 * 5
5 quotient <- 10 / 5
6
7 # Variable Assignments
8 x <- 25
9 y <- 5
10
15 # Printing Results
16 print ( sum )
17 print ( difference )
18 print ( product )
19 print ( quotient )
20 print ( z )
21 print ( result )
Listing 1: Simple R Script
1.3 Document the steps to install R and RStudio and describe the purpose of each
line of your script
Explanation of the Script
• # Basic Arithmetic Operations: This is a comment that explains what the following lines of code will do.
In R, comments begin with the # symbol.
• sum <- 10 + 5: This line performs the addition of two numbers (10 and 5) and stores the result in the variable
sum.
• difference <- 10 - 5: This line subtracts 5 from 10 and stores the result in the variable difference.
• product <- 10 * 5: This line multiplies 10 and 5 and stores the result in the variable product.
• quotient <- 10 / 5: This line divides 10 by 5 and stores the result in the variable quotient.
• result <- x / y: This line divides x by y, and stores the result in result.
• print(sum): This line prints the value of the variable sum.
• print(difference): This line prints the value of the variable difference.
• print(product): This line prints the value of the variable product.
This script demonstrates how to perform basic arithmetic operations, assign values to variables, and print the
results in R.
2.1 Vectors
A vector is the simplest data structure in R. It is a one-dimensional array that contains elements of the same type,
such as numeric, character, or logical values. Vectors are used for storing homogeneous data in a sequential manner.
Example:
1 # Numeric vector
2 numeric _ vector <- c (1 , 2 , 3 , 4 , 5)
3
4 # Character vector
5 char _ vector <- c ( " apple " , " banana " , " cherry " )
6
7 # Logical vector
8 logical _ vector <- c ( TRUE , FALSE , TRUE )
9
2.2 Matrices
A matrix is a two-dimensional data structure in R where each element must be of the same type. Matrices are
essentially vectors organized into rows and columns. They are often used for mathematical operations such as matrix
multiplication and linear algebra computations.
Example:
1 # Create a 3 x3 numeric matrix
2 matrix _ data <- matrix (1:9 , nrow = 3 , ncol = 3)
3
4 # Matrix operations
5 element <- matrix _ data [2 , 3] # Access element in row 2 , column 3
6 transpose _ matrix <- t ( matrix _ data ) # Transpose the matrix
Listing 3: Creating and manipulating matrices in R
2.3 Lists
A list is a versatile data structure that can contain elements of different types, such as numbers, strings, vectors,
and even other lists. Lists are used to store heterogeneous data, making them highly flexible for complex data
manipulation tasks.
Example:
1 # Create a list with different types of elements
2 my _ list <- list (
3 name = " John Doe " ,
4 age = 30 ,
5 scores = c (88 , 92 , 85) ,
6 passed = TRUE
7 )
7 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
2.4.1 Conclusion
The fundamental data types in R—vectors, matrices, lists, and data frames—are essential for storing and manipu-
lating various forms of data. Each data type serves a specific purpose, from handling simple sequences of numbers
(vectors) to complex datasets with multiple variables (data frames). Learning how to create and manipulate these
data types is key to effective data analysis in R.
2.5 Design an R program to create and manipulate vectors, matrices, lists, and data
frames. Include operations such as indexing, subsetting, and applying functions
like sum(), mean(), and length().
The following R script demonstrates the creation and manipulation of vectors, matrices, lists, and data frames,
including basic operations such as indexing, subsetting, and applying functions like sum(), mean(), and length():
1 # Creating a vector
2 v <- c (10 , 20 , 30 , 40)
3
12
13 # Creating a matrix
14 m <- matrix (1:9 , nrow =3 , ncol =3)
15
23 # Creating a list
24 my _ list <- list ( name = " Alice " , age = 25 , scores = c (85 , 90 , 95) )
25
2.6 Create a data frame from scratch, perform basic operations, and describe the
structure and type of each element in the data frame.
A data frame is a table-like structure in R, where each column contains values of one variable and each row contains
values for each observation. Below is an R program that creates a data frame from scratch and performs basic
operations:
1 # Creating a data frame from scratch
2 students _ df <- data . frame (
3 Name = c ( " Alice " , " Bob " , " Charlie " ) ,
4 Age = c (22 , 25 , 24) ,
5 GPA = c (3.8 , 3.9 , 3.7)
6 )
7
• m <- matrix(1:9, nrow=3, ncol=3): This creates a 3x3 matrix with elements from 1 to 9.
• element <- m[2, 3]: This accesses the element in the second row and third column of matrix m.
• my list <- list(): This creates a list with elements of different data types (string, numeric, and vector).
• df <- [Link](): This creates a data frame with three columns: Name, Age, and Score.
• students df <- [Link](): This creates a data frame with student information, including their name,
age, and GPA.
• str(students df): This function displays the structure of the data frame.
• class(): This function is used to determine the data type of each element in the data frame (e.g., character,
numeric).
The script demonstrates how to work with various data types in R, such as vectors, matrices, lists, and data
frames. It also showcases key operations like indexing, subsetting, and applying functions such as sum(), mean(),
and length().
where xi represents each individual data point, and n is the total number of observations. The mean is
particularly sensitive to outliers, as extreme values can significantly affect it.
Relevance: The mean provides a quick measure of the central tendency, giving an idea of the ”average” value
of the dataset. In the case of sepal lengths from the Iris dataset, the mean provides an overall sense of the
typical sepal length.
1 # Plotting Histogram for Sepal Length
2 hist ( iris $ Sepal . Length , main = " Histogram of Sepal Length " , xlab = " Sepal
Length " , col = " lightblue " )
3 abline ( v = mean ( iris $ Sepal . Length ) , col = " red " , lwd =2)
4
30
20
Count
10
5 6 7 8
Sepal Length
• Median:
The median is the middle value when data is sorted in ascending or descending order. For a dataset with an
odd number of observations, it is the central value, and for an even number of observations, it is the average
of the two middle values. (
x n+1 if n is odd
Median = x n 2+x n +1
2
2
2
if n is even
Relevance: The median is robust against outliers and skewed data, providing a better central tendency
measure when the data distribution is not symmetric. In the Iris dataset, if we suspect the presence of extreme
values, the median offers a reliable alternative to the mean.
1 # Plotting Boxplot for Sepal Length
2 boxplot ( iris $ Sepal . Length , main = " Boxplot of Sepal Length " , ylab = " Sepal
Length " , col = " lightgreen " )
3 abline ( h = median ( iris $ Sepal . Length ) , col = " blue " , lwd =2)
4
7
Sepal Length
• Mode:
The mode is the value that appears most frequently in the dataset. Unlike the mean and median, the mode is
more useful for categorical or discrete data. For continuous data, like in the Iris dataset, the mode might not
provide much insight.
Relevance: The mode is helpful for identifying the most common value in a dataset. In cases of categorical
data (e.g., species), the mode shows which category occurs most frequently.
1 # Custom function to calculate mode in R
2 get _ mode <- function ( v ) {
3 uniq _ vals <- unique ( v )
4 uniq _ vals [ which . max ( tabulate ( match (v , uniq _ vals ) ) ) ]
5 }
6 mode _ sepal <- get _ mode ( iris $ Sepal . Length )
7
30
20
Count
10
5 6 7 8
Sepal Length
• Standard Deviation:
The standard deviation measures how spread out the data points are relative to the mean. It is calculated as:
v
u n
u 1 X
Standard Deviation = t (xi − x̄)2
n − 1 i=1
where x̄ is the mean of the data. A small standard deviation indicates that the data points are close to the
mean, while a large standard deviation indicates greater variability.
Relevance: The standard deviation provides an understanding of the consistency of the data. In the Iris
dataset, it can help us see whether the sepal lengths vary widely or are concentrated around the mean.
1 # Plotting Density for Sepal Length with Standard Deviation Lines
2 plot ( density ( iris $ Sepal . Length ) , main = " Density Plot of Sepal Length " ,
xlab = " Sepal Length " )
3 abline ( v = mean ( iris $ Sepal . Length ) , col = " red " , lwd =2)
4 abline ( v = mean ( iris $ Sepal . Length ) + sd ( iris $ Sepal . Length ) , col = " blue " ,
lty =2)
5 abline ( v = mean ( iris $ Sepal . Length ) - sd ( iris $ Sepal . Length ) , col = " blue " ,
lty =2)
6
• Variance:
Variance is the square of the standard deviation and is calculated as:
n
1 X
Variance = (xi − x̄)2
n − 1 i=1
0.3
Density
0.2
0.1
0.0
5 6 7 8
Sepal Length
Figure 4: Density Plot of Sepal Length with Mean and Standard Deviation Lines
Like the standard deviation, variance measures the spread of data points. However, because it is in squared
units, it is less intuitive than standard deviation, though it serves as an important statistical concept in many
models.
Relevance: Variance is useful in statistical modeling (e.g., regression, ANOVA) and other operations requiring
knowledge of data spread. A higher variance means greater variability in the data.
1 # Calculating variance
2 variance _ sepal <- var ( iris $ Sepal . Length )
3
7 # Calculate Mean
8 mean _ sepal <- mean ( sepal _ length )
9 print ( paste ( " Mean of Sepal Length : " , mean _ sepal ) )
10
11 # Calculate Median
12 median _ sepal <- median ( sepal _ length )
13 print ( paste ( " Median of Sepal Length : " , median _ sepal ) )
14
15 # Calculate Mode ( Custom function , since R doesn ' t have a built - in mode function
)
16 get _ mode <- function ( v ) {
17 uniq _ vals <- unique ( v )
18 uniq _ vals [ which . max ( tabulate ( match (v , uniq _ vals ) ) ) ]
19 }
20 mode _ sepal <- get _ mode ( sepal _ length )
21 print ( paste ( " Mode of Sepal Length : " , mode _ sepal ) )
22
27 # Calculate Variance
28 variance _ sepal <- var ( sepal _ length )
29 print ( paste ( " Variance of Sepal Length : " , variance _ sepal ) )
Listing 13: R Script for Basic Statistical Operations using the Iris Dataset
desired formats. Common libraries used for importing/exporting datasets include readr, [Link], and rio. For
open-source datasets, websites such as Kaggle, UCI Machine Learning Repository, and government portals provide
a wide range of data.
Below, we will explore these libraries and functions using an open-source dataset (for example, the Iris dataset
in CSV format) and demonstrate a program for importing, cleaning, and exporting data in R.
• [Link]: An efficient package for importing, manipulating, and exporting large datasets.
1 library ( data . table )
2 df <- fread ( " path / to / data . csv " )
3
• rio: A versatile package that supports importing and exporting data in multiple formats such as CSV, Excel,
JSON, and more. It simplifies the process with a single function for all file types.
1 library ( rio )
2 df <- import ( " path / to / data . csv " )
3 export ( df , " cleaned _ data . csv " )
4
• Kaggle Datasets - Large collection of open-source datasets for data science competitions.
• [Link] - Indian government open data portal with a range of real-world datasets.
4.2 Design an R Program to Import Data from a CSV File, Clean It, and Export the
Cleaned Data
The following R program demonstrates how to import data from a CSV file, perform basic cleaning (such as removing
missing values or NAs), and export the cleaned data into a new CSV file. We’ll also explain each function used in
detail.
1 # Load necessary libraries
2 library ( readr )
3
4.3 Steps to Check the Structure, Summarize the Contents, and Verify the Export
of the Cleaned Data
Once the data has been imported, it is essential to understand its structure, clean it if necessary, and verify that it
has been exported correctly. Below are the steps to follow:
1. Check the structure of the imported data: Use the str() function to examine the data’s structure and
understand its composition, including the number of rows, columns, and the data types of each variable.
1 str ( data )
2 # Example output :
3 # ' data . frame ': 150 obs . of 5 variables :
4 # $ Sepal . Length : num 5.1 4.9 4.7 4.6 5 ...
5 # $ Sepal . Width : num 3.5 3 3.2 3.1 3.6 ...
6 # $ Petal . Length : num 1.4 1.4 1.3 1.5 1.4 ...
7 # $ Petal . Width : num 0.2 0.2 0.2 0.2 0.2 ...
8 # $ Species : chr " setosa " " setosa " " setosa " " setosa " ...
9
2. Summarize the contents of the dataset: Use the summary() function to get a quick statistical summary
of each variable in the dataset (mean, median, min, max, etc.).
1 summary ( data )
2 # Example output for a numeric column ( Sepal . Length ) :
3 # Min . :4.300
4 # 1 st Qu .:5.100
5 # Median :5.800
6 # Mean :5.843
7 # 3 rd Qu .:6.400
8 # Max . :7.900
9
3. Verify the successful export of the cleaned data: After exporting the cleaned data to a new CSV file,
confirm its successful export by re-importing the CSV and checking the structure again.
17 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
– Example: The bar plot displays the count of cars with different numbers of cylinders (cyl). It provides
insight into the most common cylinder configurations.
– Customization: Colors and labels are customized to improve readability.
– Relevance: It helps in understanding the distribution of categorical data.
• Line Plot:
– Example: The line plot shows the trend of mpg across the cars, indexed sequentially. It highlights any
observable pattern in fuel efficiency across the dataset.
– Customization: The line color is changed to make the plot visually appealing.
– Relevance: Useful for detecting trends or seasonal patterns over a continuous variable.
• Scatter Plot:
– Example: The scatter plot illustrates the relationship between hp (horsepower) and mpg. Each point
represents a car, with its horsepower and fuel efficiency plotted.
– Customization: The point color is adjusted, and axis labels are provided to improve understanding.
– Relevance: Scatter plots are excellent for identifying correlations between two continuous variables.
Histogram of MPG
5
3
Count
10 15 20 25 30 35
Miles per Gallon
Figure 5: Histogram of Miles per Gallon (MPG) from the mtcars dataset.
10
Count
4 6 8
Number of Cylinders
Figure 6: Bar Plot showing the distribution of cylinders in the mtcars dataset.
• Line Plot (MPG Index): The line plot shows no clear trend across the index, as mtcars is a small, unordered
dataset of car performance metrics.
30
Miles per Gallon
25
20
15
10
0 10 20 30
Index
Figure 7: Line Plot of MPG across the index of the mtcars dataset.
• Scatter Plot (MPG vs Horsepower): The scatter plot reveals a negative correlation between mpg and hp.
As horsepower increases, fuel efficiency (mpg) tends to decrease, reflecting the trade-off between power and
efficiency in vehicles.
30
Miles per Gallon
25
20
15
10
100 200 300
Horsepower
Module - 2
Lab Programs
• Missing Values: Data may have missing values due to errors in data collection or entry. For instance, a
survey dataset might have unanswered questions.
– Example: A customer survey dataset where some respondents did not provide their age. This missing
information needs to be handled to avoid biases.
• Duplicate Entries: Duplicate rows can occur from repeated data entry or merging datasets. These duplicates
can skew analysis results.
– Example: An employee database where an employee is accidentally entered multiple times with different
IDs. This needs to be corrected to maintain data integrity.
• Inconsistent Formatting: Different formats for dates, text, or numerical values can make analysis challeng-
ing. For example, dates might be recorded as “MM/DD/YYYY” in some entries and “DD/MM/YYYY” in
others.
– Example: A sales dataset where some dates are in the format “01-12-2024” and others are “2024/01/12.”
Consistent formatting is essential for accurate time-based analysis.
• Outliers: Extreme values that deviate significantly from other observations can affect statistical measures.
These outliers might be due to data entry errors or actual rare events.
– Example: In a dataset of student exam scores, an entry of 200 in a 100-point exam could be an error
that needs addressing.
• Incorrect Data Types: Data might be recorded in incorrect formats, such as numerical data stored as text,
which can complicate analysis.
– Example: A dataset with a “Price” column containing numeric values stored as text (“$100”, “200”)
instead of pure numbers.
6.2 Design an R Program to Handle Missing Data, Filter Rows, and Select Specific
Columns
Objective: The objective of this exercise is to design an R program that:
• Handles missing data by imputing or removing it.
• Filters rows based on specific conditions to include only relevant data.
6.2.2 R Code
Below is the R code for data cleaning and preparation. Comments explain each step of the process.
1 # Load necessary library for data manipulation
2 library ( dplyr )
3
17 # Alternatively , you can impute missing values , for example , with the mean of
the column
18 # cleaned _ data <- data % >%
19 # mutate ( across ( everything () , ~ ifelse ( is . na (.) , mean (. , na . rm = TRUE ) , .) ) )
20
7.2.2 R Code
Below is the R code demonstrating the use of dplyr functions. Each operation is explained, and expected outputs
are provided.
1 # Load the necessary library
2 library ( dplyr )
3
Expected Output:
Expected Output:
mpg cyl hp
Mazda RX4 21.0 6 110
Mazda RX4 Wag 21.0 6 110
Datsun 710 22.8 4 93
Hornet 4 Drive 21.4 6 110
Hornet Sportabout 18.7 8 175
Expected Output:
1 # Create a new column ' hp _ per _ cyl ' that is the horsepower divided by the number
of cylinders
2 mutated _ data <- data % >%
3 mutate ( hp _ per _ cyl = hp / cyl )
4
Expected Output:
Expected Output:
mean_mpg total_hp
1 20.09 2611
Expected Output:
• Filtering Rows: The filter() function allows for extracting rows that meet certain criteria. This is useful
for narrowing down the data to include only the rows of interest.
• Creating New Columns: The mutate() function is used to add new columns or modify existing ones. It
allows for the transformation of data within the data frame, such as calculating new metrics.
• Summarizing Data: The summarize() function aggregates data to produce summary statistics. This is
useful for gaining insights into the overall characteristics of the dataset.
• Arranging Rows: The arrange() function sorts rows based on column values. This helps in organizing the
data, such as ordering by descending horsepower to easily identify the most powerful cars.
8.2.2 R Code
Below is the R code demonstrating various advanced plotting techniques with ggplot2. Each plot is explained, and
expected outputs are provided.
1 # Load the necessary library
2 library ( ggplot2 )
3
Expected Output:
8.2.3 Faceting
Expected Output:
This plot will display scatter plots of horsepower vs. miles per gallon, faceted by the number of cylinders. Each
panel will represent a different number of cylinders, with points colored by cylinder count.
30
25
Miles Per Gallon
Number of Cylinders
4
6
8
20
15
10
Figure 9: Faceted scatter plot of MPG vs HP, separated by number of cylinders. Each panel represents a different
number of cylinders, with points colored by cylinder count.
Expected Output:
This plot will display a scatter plot of weight vs. miles per gallon, with points colored by transmission type and
shaped by the number of cylinders. Custom colors and shapes enhance the visual differentiation.
30
Transmission
Automatic
Miles Per Gallon
25 Manual
Number of Cylinders
20
4
6
8
15
10
2 3 4 5
Weight
Figure 10: Scatter plot of MPG vs Weight with customized aesthetics. Points are colored by transmission type and
shaped by the number of cylinders. Custom colors and shapes enhance visual differentiation.
Expected Output:
This plot will show a scatter plot of horsepower vs. miles per gallon, with annotations displaying the row names
(car models) next to each point. This helps in identifying individual data points.
300
Ford Pantera L
Camaro
Duster
Z28
360
Chrysler Imperial
Lincoln Continental
Number of Cylinders
Horsepower
Cadillac Fleetwood
200 4
MercMerc
450SLC
Merc
450SE
450SL 6
Hornet
Pontiac
Ferrari
Sportabout
Firebird
Dino
8
Dodge
AMC Challenger
Javelin
MercMerc
280C280
Mazda
Hornet
MazdaRX44142E
RX4
Drive
Wag Lotus Europa
Volvo
Valiant
ToyotaMerc
Corona
100 Datsun230
710
Porsche 914−2
10 15 20 25 30 35
Miles Per Gallon
Figure 11: Scatter plot of HP vs MPG with annotations. Each data point is labeled with the car model name, aiding
in identifying individual observations.
• Mean: The arithmetic average of all data points. It provides a measure of central tendency but can be affected
by outliers.
• Median: The middle value in a sorted dataset. It is less sensitive to outliers and skewed distributions compared
to the mean.
• Range: The difference between the maximum and minimum values. It indicates the spread of the data but
can be sensitive to extreme values.
• Quartiles: Values that divide the data into four equal parts. They provide insights into the spread and central
tendency, including the first quartile (Q1), median (Q2), and third quartile (Q3).
• Standard Deviation: Measures the average distance of each data point from the mean. It indicates how
spread out the data points are.
• Variance: The square of the standard deviation. It also measures the spread of data but is expressed in
squared units.
9.2 Design an R Program to Generate Descriptive Statistics and Create a Data Sum-
mary Report
We will use the ‘mtcars‘ dataset for this example. The ‘mtcars‘ dataset includes various attributes of different car
models, such as miles per gallon (mpg), number of cylinders, horsepower, etc. We will calculate descriptive statistics
and generate a summary report.
1 # Load necessary library
2 library ( dplyr )
3
7 # Calculate mean , median , range , quartiles , standard deviation , and variance for
MPG
8 mean _ mpg <- mean ( data $ mpg )
9 median _ mpg <- median ( data $ mpg )
10 range _ mpg <- range ( data $ mpg )
11 quartiles _ mpg <- quantile ( data $ mpg )
12 sd _ mpg <- sd ( data $ mpg )
13 variance _ mpg <- var ( data $ mpg )
14
Statistic Value
Mean of MPG 20.09
Median of MPG 19.20
Range of MPG 10.40 - 33.90
First Quartile (Q1) 15.20
Third Quartile (Q3) 22.80
Standard Deviation 6.03
Variance 36.34
• Mean of MPG: The average miles per gallon for cars in the dataset is 20.09. This value represents the central
tendency of the data.
• Median of MPG: The median value of 19.20 is less influenced by extreme values compared to the mean. It
indicates the middle point of the dataset.
• Range of MPG: The range from 10.40 to 33.90 shows the spread of the MPG values across the dataset.
• Quartiles: The first quartile (Q1) at 15.20 and the third quartile (Q3) at 22.80 provide insights into the spread
of the middle 50% of the data.
• Standard Deviation: A standard deviation of 6.03 indicates the average deviation of the MPG values from
the mean. A higher standard deviation means greater variability.
• Variance: The variance of 36.34, being the square of the standard deviation, also measures data dispersion
but in squared units.
The summary table consolidates these measures, offering a comprehensive view of the data’s central tendency
and variability. The table is saved as a CSV file for further analysis or reporting.
Y = β0 + β1 X + ϵ (1)
where:
• ϵ represents the error term (the difference between the observed and predicted values of Y ).
The goal of simple linear regression is to determine the values of β0 and β1 that minimize the sum of the squared
errors (Residual Sum of Squares, RSS). This is achieved using methods like Ordinary Least Squares (OLS).
Y = β0 + β1 X1 + β2 X2 + · · · + βp Xp + ϵ (2)
where X1 , X2 , . . . , Xp are the independent variables, and β1 , β2 , . . . , βp are the corresponding coefficients.
10.3.1 1. Economics
In economics, linear regression can model the relationship between a country’s GDP and factors such as education
level, employment rate, and investment. For instance, a researcher might use linear regression to predict the GDP
based on the level of investment in education.
10.3.2 2. Healthcare
Healthcare professionals use linear regression to analyze the impact of various factors on health outcomes. For
example, linear regression can be used to predict blood pressure based on age, weight, and cholesterol levels. This
helps in identifying risk factors and improving patient care.
10.3.3 3. Marketing
In marketing, linear regression can help predict sales based on advertising expenditure, price changes, and other
marketing strategies. For instance, a company might use linear regression to forecast future sales based on past
advertising spend and product pricing.
10.3.5 5. Education
In education, linear regression can analyze the impact of various factors on student performance. For example, it can
be used to predict student grades based on study hours, attendance, and participation in extracurricular activities.
4 # Load dataset
5 data ( mtcars )
6
37 # Plot 3: Q - Q Plot
38 ggplot ( data = model , aes ( sample = . stdresid ) ) +
39 stat _ qq () +
40 stat _ qq _ line () +
41 labs ( title = "Q - Q Plot of Residuals " ,
42 x = " Theoretical Quantiles " ,
43 y = " Standardized Residuals " ) +
44 theme _ minimal ()
45
30
Miles Per Gallon
20
10
2 3 4 5
Weight (1000 lbs)
Figure 12: Scatter plot of MPG versus Weight with the regression line.
Explanation:
• The scatter plot shows the relationship between the independent variable (Weight of the car) on the x-axis and
the dependent variable (Miles per Gallon) on the y-axis.
• The red line is the fitted linear regression line showing the predicted relationship between ‘wt‘ and ‘mpg‘.
Analysis:
• Advantages: The plot allows easy visualization of the relationship between variables. The slope of the
regression line helps understand the trend, i.e., as weight increases, MPG decreases.
• Disadvantages: If the relationship between the variables is nonlinear, this plot may not capture it accurately.
Residuals vs Fitted
5.0
Residuals
2.5
0.0
−2.5
−5.0
10 15 20 25 30
Fitted Values
Explanation:
• This plot shows the residuals (difference between observed and predicted values) on the y-axis and the fitted
values (predicted by the model) on the x-axis.
• The horizontal dashed line at zero represents where residuals should ideally fall if the model fits the data well.
Analysis:
• Advantages: The residuals vs fitted values plot is useful for checking the assumption of linearity and ho-
moscedasticity. A random scatter of points around the zero line indicates that the model is appropriate.
• Disadvantages: A non-random pattern (e.g., curves) could indicate that the model does not fit well or there
is non-linearity that a linear model cannot capture.
1 # Plot 3: Q - Q Plot
2 ggplot ( data = model , aes ( sample = . stdresid ) ) +
3 stat _ qq () +
4 stat _ qq _ line () +
5 labs ( title = "Q - Q Plot of Residuals " ,
6 x = " Theoretical Quantiles " ,
7 y = " Standardized Residuals " ) +
8 theme _ minimal ()
9
2
Standardized Residuals
−1
−2
−2 −1 0 1 2
Theoretical Quantiles
Explanation:
• The Q-Q plot compares the standardized residuals to a theoretical normal distribution. If the residuals follow
a normal distribution, they will lie along the straight Q-Q line.
Analysis:
• Advantages: This plot checks the normality assumption of residuals. If the points follow the line, it suggests
that the residuals are normally distributed, which is a key assumption of linear regression.
• Disadvantages: Significant deviations from the line suggest that the residuals are not normally distributed,
which could invalidate the regression results.
Scale−Location Plot
Square Root of Standardized Residuals
1.2
0.8
0.4
10 15 20 25 30
Fitted Values
Explanation:
• The Scale-Location plot (also known as the Spread-Location plot) helps check the assumption of homoscedas-
ticity, or equal variance of residuals.
• It plots the square root of the standardized residuals versus the fitted values.
Analysis:
• Advantages: If the residuals show a random scatter around the line, it suggests that the variance of errors is
consistent across all fitted values.
• Disadvantages: A funnel shape indicates heteroscedasticity, meaning that the variance of the errors is not
constant, which could affect the reliability of the regression model.
2
Standardized Residuals
.cooksd
1 0.5
0.4
0 0.3
0.2
0.1
−1
−2
0.05 0.10 0.15 0.20
Leverage
Explanation:
• This plot shows the leverage (a measure of how far an independent variable deviates from its mean) versus
standardized residuals. Points outside the dashed lines (between -2 and 2) may be influential points.
• It also colors the points by Cook’s distance, a measure that shows the influence of each point on the regression
model.
Analysis:
• Advantages: This plot identifies outliers or high-leverage points, which could disproportionately affect the
regression model.
• Disadvantages: If the plot identifies many influential points, the model may be unstable or require further
investigation to handle outliers.
10.6 Conclusion
Linear regression is a versatile and widely used statistical method that helps in understanding and predicting re-
lationships between variables. By fitting a linear model to data, one can make informed decisions and predictions
based on empirical evidence. It is essential in various domains, including economics, healthcare, marketing, real
estate, and education.
Module - 3
Lab Programs
5 # Load dataset
6 data ( Boston )
7
26 # Plot Q - Q plot
27 ggplot ( data = model , aes ( sample = . stdresid ) ) +
28 stat _ qq () +
29 stat _ qq _ line () +
30 labs ( title = "Q - Q Plot of Residuals " ,
Residuals vs Fitted
40
20
Residuals
−20
0 10 20 30 40
Fitted Values
5.0
Standardized Residuals
2.5
0.0
−2.5
−2 0 2
Theoretical Quantiles
Scale−Location Plot
Square Root of Standardized Residuals
0
0 10 20 30 40
Fitted Values
5.0
Standardized Residuals
.cooksd
0.20
2.5 0.15
0.10
0.0 0.05
−2.5
• Fitting the Model: The ‘lm‘ function fits a multiple linear regression model predicting ‘medv‘ from ‘rm‘,
‘crim‘, and ‘tax‘.
• Model Summary: The ‘summary‘ function provides coefficients, R-squared values, and significance levels.
• Residuals vs Fitted Plot: This plot helps identify heteroscedasticity (non-constant variance of residuals).
Ideally, residuals should be randomly scattered around zero.
• Q-Q Plot: This plot checks if residuals are normally distributed. Points should lie along the theoretical
quantiles line.
• Scale-Location Plot: This plot helps assess homoscedasticity. A horizontal line indicates equal variance.
• Leverage vs Standardized Residuals Plot: This plot helps detect influential data points. Points with high
leverage or large residuals might unduly influence the model.
• Coefficient Interpretation: High multicollinearity can make the coefficients of the predictors unstable and
sensitive to small changes in the data. This instability can lead to large standard errors for the coefficients,
which means that the confidence intervals for the predictors are wider and less informative.
• Model Performance: Multicollinearity can affect the overall performance of the regression model. While the
model might still have a high R-squared value, the inflated standard errors and unstable coefficients can reduce
the model’s predictive power and its reliability for making inferences.
• Feature Selection: In the presence of multicollinearity, it becomes challenging to determine which predictor
variables are truly significant. This can impact feature selection processes and lead to incorrect conclusions
about which variables are important for predicting the response variable.
How Multicollinearity is Detected and Addressed:
• Variance Inflation Factor (VIF): The VIF quantifies how much the variance of a regression coefficient is
inflated due to multicollinearity. It is calculated for each predictor variable and is defined as:
1
VIFi =
1 − Ri2
where Ri2 is the R-squared value obtained by regressing the i-th predictor on all other predictors. A VIF value
greater than 10 is often used as a threshold to indicate significant multicollinearity.
• Correlation Matrix: Examining the correlation matrix of the predictor variables can provide insights into
which variables are highly correlated. High pairwise correlations (close to ±1) may indicate potential multi-
collinearity issues.
• Condition Index: The condition index measures the sensitivity of the regression coefficients to small changes
in the predictor variables. A high condition index (typically above 30) suggests multicollinearity.
• Regularization Techniques: Techniques such as Ridge Regression or LASSO (Least Absolute Shrinkage and
Selection Operator) can be used to address multicollinearity by adding a penalty to the size of the coefficients,
thereby stabilizing the model.
In summary, addressing multicollinearity is a critical aspect of building reliable and interpretable regression models
in advanced data analysis. By identifying and mitigating multicollinearity, we can improve the stability and validity
of the regression coefficients and enhance the overall performance of the model.
1 # Load necessary library
2 library ( car )
3
10 # Calculate VIF
11 vif ( model )
Listing 31: Checking for Multicollinearity
Expected Output: The output from the ‘vif‘ function will provide VIF values for each predictor variable. For
example:
rm crim tax
1.500 1.200 1.800
These VIF values are below 10, indicating no significant multicollinearity issues.
46 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
Call:
lm(formula = medv ~ rm + tax, data = Boston)
Coefficients:
(Intercept) rm tax
3.234e+01 -2.015e+00 -6.321e-03
Explanation of Output:
• Call: Shows the final model formula selected by stepwise selection. In this example, the model includes ‘rm‘
(average number of rooms) and ‘tax‘ (property tax rate) as predictors.
• Coefficients: Provides the estimated coefficients for the intercept, ‘rm‘, and ‘tax‘. For instance, an intercept
of 32.34, a coefficient of -2.015 for ‘rm‘, and -0.00632 for ‘tax‘ indicate how these variables influence the median
value of homes (‘medv‘).
• Residual Standard Error: Measures the average distance between the observed values and the predicted
values. A lower value suggests a better fit.
• Multiple R-squared: Indicates the proportion of the variance in the response variable that is predictable
from the predictors. A value of 0.625 means 62.5
47 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
• Adjusted R-squared: Adjusts the R-squared value for the number of predictors in the model. It provides a
more accurate measure of model performance, especially when multiple predictors are used.
Significance: The results of stepwise selection provide a refined model with predictors that are deemed most
significant according to the AIC criterion. This model is often preferred for its balance between fit and complexity,
helping to avoid overfitting while maintaining interpretability.
Stepwise selection, however, should be used cautiously as it relies on statistical criteria and may not always produce
the best model for all applications. It is beneficial to complement this method with other techniques and domain
knowledge.
11.4 Cross-Validation
Cross-validation is a robust technique used to assess the performance and generalizability of a predictive model. It
helps ensure that the model is not overfitting to the training data and provides an estimate of how well the model
will perform on unseen data. One common method of cross-validation is k-fold cross-validation.
K-Fold Cross-Validation: In k-fold cross-validation, the dataset is divided into k equally sized subsets, or ”folds.”
The model is trained on k−1 of these folds and validated on the remaining fold. This process is repeated k times, with
each fold serving as the validation set exactly once. The overall performance is then averaged over all k iterations to
provide a more reliable estimate of the model’s generalizability.
Implementation in R: In R, the ‘[Link]‘ function from the ‘boot‘ package is used to perform k-fold cross-validation
for generalized linear models. Below is the R code for performing cross-validation:
1 # Load necessary library
2 library ( boot )
3
• Loading the Library: The ‘boot‘ library is required to use the ‘[Link]‘ function.
• Defining Cross-Validation Function: The ‘[Link]‘ function performs cross-validation on the specified
model (‘stepwise model‘). The ‘K=10‘ argument specifies 10-fold cross-validation, meaning the dataset is split
into 10 subsets.
• Output: The result of ‘[Link]‘ is stored in ‘cv‘, and ‘[Link]‘ provides the cross-validation errors.
Expected Output: The output from the ‘[Link]‘ function will include cross-validation errors. For example:
• Cross-Validated Estimate of Model’s Error: The first value, ‘23.47‘, represents the average error of the
model across all folds. This value indicates how well the model performs on average when validated on different
subsets of the data.
• Standard Error of the Estimate: The second value, ‘25.32‘, is the standard error of the cross-validated
estimate. It provides an indication of the variability in the model’s error across different folds. A lower standard
error suggests more consistent performance of the model.
11.4.1 Analysis
• Model Summary: The summary provides the regression coefficients for ‘rm‘, ‘crim‘, and ‘tax‘, along with
R-squared and adjusted R-squared values.
• Residuals vs Fitted Plot: Should display residuals scattered randomly around zero, indicating that the
model’s assumptions are satisfied.
• Q-Q Plot: Should show residuals approximately following a 45-degree line if they are normally distributed.
• Scale-Location Plot: Should display a horizontal trend, suggesting that residuals are homoscedastic.
• Leverage vs Standardized Residuals Plot: Points should fall within acceptable bounds, indicating no
extreme leverage or influential outliers.
• VIF Results: Provide insight into multicollinearity. VIF values above 10 suggest high multicollinearity.
• Stepwise Selection: Reveals the most significant predictors in the model.
• Cross-Validation: Provides an estimate of how well the model performs on unseen data, with lower cross-
validation error indicating better model performance.
– Linear regression
– Decision trees
– Neural networks
• Unsupervised Learning: Works with data that has no labeled outcomes. The goal is to uncover hidden
patterns or structures in the data. Common algorithms include:
– Clustering techniques (e.g., K-means, hierarchical clustering)
– Dimensionality reduction methods (e.g., Principal Component Analysis, PCA)
• Reinforcement Learning: Focuses on training an agent to make decisions in a dynamic environment by:
R is one of the most widely used programming languages for machine learning due to its rich ecosystem of libraries,
ease of use, and strong community support. In this section, we will explore unsupervised learning by implementing
the K-means clustering algorithm. Clustering is a technique that groups data points into clusters based on similarity.
K-means is a widely used clustering algorithm that partitions a dataset into a predefined number of clusters, k.
The algorithm minimizes the sum of the squared distances between each data point and the centroid of its assigned
cluster.
Expected Output: The expected output from the code will show a scatter plot of the iris data points, colored
by their assigned cluster.
4.0
Cluster
Sepal Width
3.5
1
2
3.0
3
2.5
2.0
5 6 7 8
Sepal Length
Figure 21: K-means Clustering of Iris Data Based on Sepal Length and Sepal Width
The K-means clustering algorithm partitions the iris dataset into three clusters. Each data point is assigned to a
cluster based on the proximity to the cluster centroid. The visualization shows how well the algorithm separates the
different species based on Sepal Length and Sepal Width. In this case, the clusters roughly correspond to the three
species of iris in the dataset: setosa, versicolor, and virginica.
Expected Output: The Elbow Method plot will show the relationship between the number of clusters and the
WSS.
2 4 6 8 10
Number of Clusters
Analysis: From the Elbow plot (Figure 22), we observe that the WSS sharply decreases until k = 3, after which
the decrease slows. This suggests that k = 3 is the optimal number of clusters for the iris dataset, aligning with the
known species.
Practical Applications of Clustering: Clustering techniques like K-means are widely used in various fields:
• Customer Segmentation: Businesses use clustering to segment customers based on purchasing behavior,
enabling targeted marketing strategies.
• Image Compression: Clustering can reduce the number of colors in an image, thereby compressing it without
significant loss of quality.
• Document Classification: Clustering is used to organize large datasets of documents, such as categorizing
news articles or research papers based on topics.
• Anomaly Detection: In cybersecurity, clustering is employed to detect unusual patterns in network traffic
that could signal security breaches.
In data analysis, clustering helps identify inherent structures within the data, enabling more informed decision-
making. By grouping similar data points, clustering can simplify complex datasets and provide valuable insights
that guide business strategies, scientific research, and technological advancements.
R provides a robust environment for time series analysis, offering functions for importing data, performing exploratory
analysis, and fitting complex models. In TSA, the key is understanding the components of the series—namely trend,
The most widely used models for forecasting time series data are ARIMA models (AutoRegressive Integrated Moving
Average), which combine three different elements:
13.1 Design an R Program to Analyze and Forecast Time Series Data using ARIMA
Models
To illustrate TSA with R, we will use an open-source dataset, ‘AirPassengers‘, which contains monthly totals of
international airline passengers from 1949 to 1960.
1 # Load necessary libraries
2 library ( forecast )
3 library ( tseries )
4
AirPassengers Data
600
500
Passengers
400
300
200
100
Year
0.0
-0.1
-0.2
Time
Series: log(ts_data)
ARIMA(0,1,1)(0,1,1)[12]
Coefficients:
ma1 sma1
-0.4018 -0.5569
s.e. 0.0896 0.0731
• The (0,1,1)[12] part refers to the seasonal components of the ARIMA model, where there is no seasonal AR
term (0), one order of seasonal differencing (1), and one seasonal MA term (1). The [12] indicates a periodicity
of 12, suggesting that the data exhibits yearly seasonality, which is common for monthly data.
[Link] Coefficients:
• ma1 and sma1 represent the moving average (MA) and seasonal moving average (SMA) coefficients, respectively.
• ma1 = -0.4018: This coefficient indicates the short-term relationship between past error terms and the current
observation.
• sma1 = -0.5569: This represents the seasonal component, where past errors from a year ago are factored into
the forecast.
• s.e.: These are the standard errors of the coefficients, helping to understand the uncertainty of the estimates.
Both coefficients have relatively small standard errors, suggesting the estimates are reasonably precise.
• MAE (Mean Absolute Error): 0.02626034 - This gives the average absolute difference between the pre-
dicted and actual values. Lower values suggest better predictions.
• MPE (Mean Percentage Error): 0.01098898 - This is the mean of the percentage errors in the model’s
predictions, showing the average bias.
• MAPE (Mean Absolute Percentage Error): 0.4752815 - It shows the average percentage error and is
more interpretable than RMSE or MAE. A MAPE below 1 indicates a relatively accurate model.
• MASE (Mean Absolute Scaled Error): 0.2169522 - This is another scaled error metric; a value below 1
indicates better performance compared to a naı̈ve model.
• ACF1 (Autocorrelation of residuals at lag 1): 0.01443892 - This measures the correlation of residuals
with their lagged values. A low value suggests that the residuals are uncorrelated, which is a good sign.
This summary highlights that the ARIMA model effectively fits the data, with reasonably low error metrics and
coefficients that explain the time series dynamics well. The model captures both the non-seasonal and seasonal
elements of the series, making it a strong candidate for forecasting.
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1961 455.9929 446.8214 465.1645 441.8634 470.1223
Feb 1961 421.0283 411.8568 430.1999 406.8988 435.0579
Mar 1961 462.3956 453.2241 471.5672 448.2661 476.5252
...
Dec 1961 475.0584 465.8869 484.2300 460.9289 489.1880
56 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
This table provides the forecasted values (Point Forecast) for each month, along with the lower and upper
bounds of the 80% and 95% confidence intervals.
[Link] Plot Interpretation: The generated plot (refer to Figure ??) will display the following elements:
• Historical Data: The actual data points of the original time series are plotted on the left side of the graph.
• Forecasted Values: The forecasted points for the next 12 months are plotted on the right side.
• Confidence Intervals: The shaded areas around the forecast represent the 80% and 95% confidence intervals.
The wider the intervals, the more uncertainty is associated with the forecast.
The plot provides valuable insights into the expected future behavior of the time series, and the confidence intervals
indicate the range within which the actual future values are likely to fall. In this case, we are forecasting the
AirPassengers dataset for the next 12 months.
13.4 Use a Time Series Dataset, Perform Exploratory Data Analysis, Fit an ARIMA
Model, and Make Future Forecasts
We will continue using the ‘AirPassengers‘ dataset. First, we perform exploratory data analysis (EDA) to understand
the structure of the data. EDA for time series typically involves visualizing the data, checking for trends, seasonality,
and stationarity.
1 # Plot original data
2 plot ( ts _ data , main = " Original AirPassengers Data " , ylab = " Passengers " )
3
AirPassengers Data
600
500
Passengers
400
300
200
100
Year
Time
The test is applied to the log-transformed series (log(ts data)) with the alternative hypothesis being that the
time series is stationary. The [Link] function from the tseries package is used.
data: log(ts_data)
Dickey-Fuller = -6.4215, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary
• Lag Order: The test uses a lag order of 5. This means that 5 lagged differences of the time series were
considered in the test to remove autocorrelation in the residuals.
• p-value: The p-value for the test is 0.01. Since this value is less than the commonly used significance levels
(e.g., 0.05 or 0.01), we reject the null hypothesis. This implies that there is significant evidence to conclude
that the time series is stationary.
• Alternative Hypothesis: The alternative hypothesis is that the series is stationary. Based on the p-value,
we conclude that the time series (log-transformed) is indeed stationary.
6.0
Log of Values
5.5
5.0
The ADF test results (Figure 28) show that the series can be considered stationary, thus fulfilling an important
precondition for ARIMA modeling.
Expected Output:
• A plot of the original data showing the overall trend and seasonal pattern.
[Link] Conclusion: Based on the results of the Augmented Dickey-Fuller test, with a test statistic of -6.4215
and a p-value of 0.01, we reject the null hypothesis of non-stationarity. Therefore, the log-transformed time series is
stationary, and we can proceed with fitting time series models such as ARIMA for forecasting. Ensuring stationarity
is crucial for achieving reliable forecasts, as non-stationary data can lead to inaccurate model estimates.
Interpreting the ARIMA Model and Forecast Results After fitting the ARIMA model, the forecast results
give predicted values along with confidence intervals. These forecasts help understand how the number of passengers
is expected to grow in the future. By examining the residuals, we ensure that the ARIMA model captures the
significant patterns of the data while leaving only white noise in the residuals.
Time
13.6 Include Steps to Check for Stationarity, Select Model Parameters, and Evaluate
the Model’s Forecasting Accuracy
Stationarity Check: One of the key steps in time series analysis is ensuring that the data is stationary. This means
that the mean, variance, and covariance of the series are constant over time. If the data is non-stationary, it needs
to be transformed. In the ARIMA model, differencing is used to make the data stationary.
We check for stationarity using visual inspection of the time series and statistical tests like the Augmented Dickey-
Fuller (ADF) test. If the p-value of the test is below a certain threshold (typically 0.05), the series is considered
stationary.
1 # Perform ADF test
2 adf _ test <- adf . test ( log ( ts _ data ) , alternative = " stationary " )
3 print ( adf _ test )
Listing 40: Stationarity Check using ADF Test
Model Parameter Selection: The ‘[Link]‘ function in R automatically selects the best combination of
AR, I, and MA components based on the Akaike Information Criterion (AIC). However, in some cases, manual tuning
of the parameters may be necessary.
1 # Fit ARIMA model manually
2 manual _ fit <- arima ( log ( ts _ data ) , order = c (2 ,1 ,2) )
3 summary ( manual _ fit )
Listing 41: Manual ARIMA Model Selection
Model Evaluation: To evaluate the accuracy of the ARIMA model, we can use several methods such as residual
diagnostics and forecasting accuracy measures like Mean Absolute Error (MAE) or Root Mean Square Error (RMSE).
1 # Plot residuals
2 checkresiduals ( fit )
3
4 # Calculate accuracy
5 accuracy ( forecast _ data )
Expected Output:
• A plot of the residuals should show no significant patterns, indicating that the ARIMA model has adequately
captured the underlying structure.
• Forecast accuracy results, which include metrics such as MAE, RMSE, and Mean Absolute Percentage Error
(MAPE).
Detailed Interpretation of Time Series Components and Forecast Results:
The decomposition of time series data reveals important components:
• Trend: The long-term movement in the data, which shows the increasing or decreasing pattern over time.
• Seasonality: The repeating short-term cycles observed in the data at regular intervals.
• Residuals: The noise or irregular component left after accounting for the trend and seasonality.
By analyzing these components, we can better understand the data’s behavior. The ARIMA model’s future
forecasts provide valuable insights into the expected growth or decline of the series, which can be used for decision-
making in areas like demand forecasting, resource allocation, and financial planning.
4 # Load dataset
5 data ( iris )
6
15 # Show plot
16 scatter _ plot
Listing 45: Interactive Scatter Plot
Expected Output: The interactive scatter plot displays Sepal Length on the x-axis and Sepal Width on the
y-axis. Different species are represented by different colors, and users can hover over data points to see additional
information.
4 # Load dataset
5 data ( " AirPassengers " )
6
19 # Show plot
20 line _ chart
4 # Load dataset
5 data ( mtcars )
6
13 # Show plot
14 bar _ chart
Listing 47: Interactive Bar Chart
Expected Output: The interactive bar chart displays the miles per gallon for different car models. Users can
hover over bars to see exact values and scroll through car models.
4 # Load dataset
5 data ( iris )
6
16
17 # Show plot
18 custom _ scatter _ plot
Listing 48: Customized Interactive Scatter Plot
Expected Output: The customized scatter plot features larger, semi-transparent markers and displays detailed
tooltips. The mode bar, which includes zoom and pan buttons, is partially removed for a cleaner interface.
14.1.6 Conclusion
Interactive visualizations using the ‘plotly‘ package in R offer an engaging way to explore and present data. By
creating interactive scatter plots, line charts, and bar charts, users can dynamically interact with their data, uncover
insights, and make informed decisions. The ability to customize interactivity further enhances the user experience
and effectiveness of these visualizations.
14.2 Figures
Below are examples of the interactive plots created using the ‘plotly‘ package:
4 # Load dataset
5 data ( iris )
6
16 # Show plot
17 scatter _ tooltip
Listing 49: Scatter Plot with Tooltips and Hover Effects
In this example, when the user hovers over any point on the scatter plot, a tooltip appears showing the species and
the corresponding Sepal Length and Width values. This allows users to explore data points interactively without
adding clutter to the chart.
13 # Show plot
14 scatter _ legend
Listing 50: Scatter Plot with Interactive Legend
In this plot, users can click on each species in the legend to hide or show the corresponding data points, offering a
flexible way to focus on specific parts of the data.
4 # Load dataset
5 data ( " AirPassengers " )
6
21 # Show plot
22 line _ custom _ hover
Listing 51: Line Chart with Custom Hover Effects
In this plot, the hover effect displays only the x (Month) and y (Passengers) values. This is achieved by setting
‘hoverinfo‘ to ‘x+y‘, simplifying the information shown when hovering over data points.
– Users can customize the interactivity of their visualizations, allowing for better tailoring of plots for
different audiences.
– Interactive legends (Figure 34) allow users to selectively view or hide data series, making it easier to focus
on relevant parts of the data.
– Interactive tools facilitate the discovery of patterns and relationships that might not be immediately
obvious in static plots.
– For example, scatter plots with hover effects can reveal clusters or outliers that would otherwise go
unnoticed.
• Increased Engagement:
– Interactive visualizations increase user engagement by offering a hands-on approach to data exploration.
– Whether used in educational settings or business presentations, direct interaction with data can foster
curiosity and promote deeper learning or understanding.
Another advantage of RMarkdown is the ease with which it allows data to be communicated clearly. By combining
code, narrative, and visualizations in one document, you can effectively convey complex insights to both technical
and non-technical audiences. Moreover, the interactive features provided in the HTML format (such as expandable
code blocks and dynamic visualizations) make it highly engaging for users.
To illustrate, consider a case where a data analyst wants to document their findings while ensuring others can follow
and replicate the process. By using RMarkdown, they can document their analysis steps (e.g., data cleaning, manip-
ulation, statistical testing) alongside the R code and output. This helps make the workflow clear and transparent,
ensuring credibility and fostering collaboration.
Overall, RMarkdown is a valuable tool in the world of data analysis because it simplifies the process of creating
reports, allows for real-time updates, and ensures transparency and reproducibility in data analysis workflows.
As one can see, the above image is filled in a title and an author and switched the output format to a HTML,
PDF, Word. Explore around this window and the tabs along the left to see all the different formats that it can output
to. When this is completed, click OK, and a new window should open with a little explanation on R Markdown files.
15.5 YAML
YAML header contains metadata of R markdown. Begins and end the header with a line of three dashes(—). You
can change the information in this section at any time by adding text or by overriding the current text.
The output value gives which type of file will build from your .rmd file
73 Programming With R, Lecture Review, Dr. Kalyan N
Programming With R
RStudio automatically adds to the notebook with this formatted default code chunk. Code chunk starts with delimiter
‘ ‘ ‘ r and ends with “‘
There are two ways to add code chunks into an R Markdown document, you can press Ctrl + Alt + I(for windows)
or Cmd + Option + I(for mac). Or you can use the Add Chunk command in the editor toolbar. In the default
code section, we find “knitr” it is an R package with lightweight APIs designed to give users full control of the
output format, it is used fully when you render your R Markdown document. We have different options in “knitr”
package
The Knit drop-down menu includes three main options: HTML, PDF, and Word document. You can use Knit to
convert your file to any of these types. When you render your file, you can preview how it will look in the format
you selected. Execute each code chunk and insert the result into your report and save the output file in your working
directory.